Sie sind auf Seite 1von 3

Introduction

Overfitting a model is a condition where a statistical model begins to describe the random error in
the data rather than the relationships between variables. This problem occurs when the model is
too complex. In regression analysis, overfitting can produce misleading R-
squared values, regression coefficients, and p-values. In this post, I explain how overfitting models
is a problem and how you can identify and avoid it.

Overfit regression models have too many terms for the number of observations. When this occurs,
the regression coefficients represent the noise rather than the genuine relationships in
the population.

That’s problematic by itself. However, there is another problem. Each sample has its own unique
quirks. Consequently, a regression model that becomes tailor-made to fit the random quirks of one
sample is unlikely to fit the random quirks of another sample. Thus, overfitting a regression model
reduces its generalizability outside the original dataset.

Taking the above in combination, an overfit regression model describes the noise, and it’s not
applicable outside the sample. That’s not very helpful, right? I’d really like these problems to sink
in because overfitting often occurs when analysts chase a high R-squared. In fact, inflated R-
squared values are a symptom of overfit models! Despite the misleading results, it can be difficult
for analysts to give up that nice high R-squared value.

How it Occurs

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of
the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too
well. Specifically, overfitting occurs if the model or algorithm shows low bias but high
variance. Overfitting is often a result of an excessively complicated model, and it can be
prevented by fitting multiple models and using validation or cross-validation to compare their
predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the
underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm
does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm
shows low variance but high bias. Underfitting is often a result of an excessively simple model.

Both overfitting and underfitting lead to poor predictions on new data sets. In my experience with
statistics and machine learning, I don’t encounter underfitting very often. Data sets that are used
for predictive modelling nowadays often come with too many predictors, not too
few. Nonetheless, when building any model in machine learning for predictive modelling, use
validation or cross-validation to assess predictive accuracy – whether you are trying to avoid
overfitting or underfitting. In regression analysis, overfitting a model is a real problem. An overfit
model can cause the regression coefficients, p-values, and R-squared to be misleading.

An overfit model is one that is too complicated for your data set. When this happens, the regression
model becomes tailored to fit the quirks and random noise in your specific sample rather than
reflecting the overall population. If you drew another sample, it would have its own quirks, and
your original overfit model would not likely fit the new data.

Instead, we want our model to approximate the true model for the entire population. Our model
should not only fit the current sample, but new samples too.

The fitted line plot illustrates the dangers of overfitting regression models. This model appears to
explain a lot of variation in the response variable. However, the model is too complex for the
sample data. In the overall population, there is no real relationship between the predictor and the
response.

How to Detect and Avoid Overfit Models

Cross-validation can detect overfit models by determining how well your model generalizes to
other data sets by partitioning your data. This process helps you assess how well the model fits
new observations that weren't used in the model estimation process.

Minitab statistical software provides a great cross-validation solution for linear models by
calculating predicted R-squared. This statistic is a form of cross-validation that doesn't require you
to collect a separate sample. Instead, Minitab calculates predicted R-squared by systematically
removing each observation from the data set, estimating the regression equation, and determining
how well the model predicts the removed observation.

If the model does a poor job at predicting the removed observations, this indicates that the model
is probably tailored to the specific data points that are included in the sample and not generalizable
outside the sample.

How To Avoid

To avoid overfitting your model in the first place, collect a sample that is large enough so you can
safely include all of the predictors, interaction effects, and polynomial terms that your response
variable requires. The scientific process involves plenty of research before you even begin to
collect data. You should identify the important variables, the model that you are likely to specify,
and use that information to estimate a good sample size.
Conclusion

The optimal function usually needs verification on bigger or completely new datasets. There are,
however, methods like minimum spanning tree or life-time of correlation that applies the
dependence between correlation coefficients and time-series (window width). Whenever the
window width is big enough, the correlation coefficients are stable and don't depend on the window
width size anymore. Therefore, a correlation matrix can be created by calculating a coefficient of
correlation between investigated variables. This matrix can be represented topologically as a
complex network where direct and indirect influences between variables are visualized.

Das könnte Ihnen auch gefallen