Sie sind auf Seite 1von 18

Quality & Quantity 32: 229–245, 1998.

229
© 1998 Kluwer Academic Publishers. Printed in the Netherlands.

Goodness of Fit in Regression Analysis – R 2 and


G2 Reconsidered

CURT HAGQUIST1 & MAGNUS STENBECK2


1 Centre for Public Health Research, University of Karlstad, S-65188 Karlstad, Sweden; 2 Centre for
Epidemiology, National Board of Health and Welfare, S-106 30 Stockholm, Sweden

Abstract. There has been considerable debate on how important goodness of fit is as a tool in
regression analysis, especially with regard to the controversy on R 2 in linear regression. This article
reviews some of the arguments of this debate and its relationship to other goodness of fit measures.
It attempts to clarify the distinction between goodness of fit measures and other model evaluation
tools as well as the distinction between model test statistics and descriptive measures used to make
decisions on the agreement between models and data. It also argues that the utility of goodness of fit
measures depends on whether the analysis focuses on explaining the outcome (model orientation) or
explaining the effect(s) of some regressor(s) on the outcome (factor orientation).
In some situations a decisive goodness of fit test statistic exists and is a central tool in the analysis.
In other situations, where the goodness of fit measure is not a test statistic but a descripitive measure,
it can be used as a heuristic device along with other evidence whenever appropriate. The availability
of goodness of fit test statistics depends on whether the variability in the observations is restricted, as
in table analysis, or whether it is unrestricted, as in OLS and logistic regression on individual data.
Hence, G2 is a decisive tool for measuring goodness of fit, whereas R 2 and SEE are heuristic tools.

1. Introduction
In sciences where causal model building holds a central position there is often
a need to evaluate theoretical models empirically. In other contexts, such as in
measurement models, the quality of data needs to be evaluated against a theoretical
model. In both situations, this is done by analysing the agreement between the data
generated by the model (predicted data, fitted or expected values) and the collected
empirical data (observed data/values). The degree of agreement is a measure of the
goodness of fit between data and model.
The usefulness of evaluating the goodness of fit of models is not undisputed.
Opinions also differ as to how best to do this, in other words what specific statistical
measure should be used for the purpose. An intense discussion has been taking
place in recent years about measures of goodness of fit for structural equation
models with latent variables (Bollen and Long, 1993). Another example is the
debate about the usefulness of R 2 in linear regression analysis which took place
a few years ago among methodologists in the USA (see Achen, 1990; King, 1990;
Lewis-Beck and Skalaban, 1990).
230 CURT HAGQUIST AND MAGNUS STENBECK

In view of the fact that regression analysis is one of the most popular and widely
used methods for analysing survey data it is important that attention be paid to
goodness of fit not only in connection with ordinary/linear regression analysis but
also in relation to logistic regression analysis.
The purpose of this article is to highlight and discuss some key questions
concerning goodness of fit in both linear and logistic regression analysis. In the
article the term logistic regression will be used as a label for logit analysis based
on individual as well as table data.
The following topics are beyond the purpose of the article:

− questions of regression assumptions and diagnostics, as well as questions


relating to the evaluation of regression coefficients, in other words other fac-
tors that should be considered in a comprehensive assessment of regression
models,
− questions relating to whether models at all should be evaluated against
empirical data, and
− the applicability of stepwise procedures in empirical model testing.

2. When and Why Does Goodness of Fit Apply?


Goodness of fit is important in model building, but it is not always important, or
even applicable, for all types of regression analyses. As a rule of thumb one may
say that goodness of fit is important when the model is focused (model orientation),
but less important when we focus on single effects (factor orientation). Model
oriented approaches try to answer questions like “what explains the outcome?”.
The model is also important when you seek answers to questions like “what mod-
ifies the effect of x on y?”. A model oriented approach requires the inclusion of
all systematic sources of variation on the outcome. If successful, the model ori-
ented approach will produce accurate predicted outcomes for individual cases. In
principle, if such a model was applied to the entire population, and if no measure-
ment error existed, there would be no unexplained variation left. Hence, the only
source of unexplained variability in such a model applied to sample observations
is randomness originating in random measurement error and random sampling.
In practice it is often not realistic to account for all the variation across individ-
uals. Very many independent variables would be needed, some of which are not
measured and others which may not even be known. In contrast, using categorical
data offers other solutions. It is because the assumption of homogeneity of the
observations within a cell with respect to their effect on the outcome is built into
the construction of the table. The possible dissimilarity in the outcome across those
cases is hidden in the table. Hence, the remaining variation should be explained in
its entirety.
GOODNESS OF FIT IN REGRESSION ANALYSIS 231

Far from all regression analyses are model oriented. There is often no intention
to explain, nor to predict the outcome. King (1986), for example, maintains that
the purpose of regression analysis is simply to measure the effects of independent
variables on a dependent variable.
The approach in this situation is to focus on a subset of effects including possi-
ble confounders. Such a factor oriented approach does not attempt to explain all of
the variation, but rather seek answers to the questions “how does x affect y?” and
“how is the effect of x on y modified by z?”. It is well known that such an analysis
only requires the simultaneous inclusion of all effects on the outcome which are
interrelated (the factor orientation).
In factor oriented linear or logistic regression on individual data goodness of fit
is not applicable. The fit of the model must be evaluated only against the specific
part of the variation which is relevant to the subset of effects of interest. An overall
test of the model fit is too general for this purpose and does not answer the right
question. However, another factor oriented approach is to make use of categorical
data in a contingency table. As the irrelevant variation hereby is removed from the
analysis an overall test of the model fit will answer the right question. Therefore, in
contingency table analysis the goodness of fit tests almost always apply, regardless
of the approach of the analysis.

3. Different Types of Measures for Evaluation of Model Fit

In the literature there are several more or less similar measures that are labeled or
described as goodness of fit statistics. In the following discussion of measures we
distinguish between absolute goodness of fit measures and relative fit measures.
We also make a distinction between test statistics and descriptive measures.
The distinctive feature of goodness of fit measures is that they assess the agree-
ment between the model predictions and the observed data. In contrast, relative fit
measures evaluate the difference between two restricted models in some unit of
deviation which is determined by the observations. The goodness of fit measures
always have as their point of departure the saturated model, i.e., a completely
unrestricted model which assigns one parameter to each observation or cell. The
relative fit measures have an abitrarily or theoretically chosen baseline model as its
point of departure, e.g., the model of independence or some other model putting
constraints on the permissible range of the observations.
The distinctive feature of a test statistic is that it has a known probability distri-
bution under some specified condition. Therefore such a measure computed from
a sample can be evaluated in order to assess the probability that the statistic comes
from that probability distribution. This makes it possible to judge whether it is
likely that the specified condition is at hand. In practice, when the residual unex-
plained variation is judged random one can say that the model cannot be rejected
and hence "fits the data". When using a test statistic, it can be determined whether
232 CURT HAGQUIST AND MAGNUS STENBECK

Figure 1. Measures for evaluation of model fit in regression analysis – a typology.

it is likely that the unexplained variation comes from a random distribution and a
statistical judgement with respect to the goodness of fit of the model can be made.
The typology in Figure 1 identifies four possible combinations of the above
criteria:
1. Goodness of fit test statistics which are decisive tools for measuring goodness
of fit.
2. Descriptive goodness of fit measures which describe the fit in an absolute sense
but which cannot be used to make tests of model fit.
3. Relative fit test statistics which are used to make tests of pairs of models but
do not measure the goodness of fit.
4. Descriptive relative fit measures which do not meet any of the two criteria on
a useful tool.
Table I lists some common measures of fit in linear and logistic regression along
with labels and some procedures associated with them in a common software pack-
age, SPSS. In the following, we will use the above classification as our point of
departure for a discussion of the merits and drawbacks of the measures.

4. Goodness of Fit in Linear Regression


4.1. r 2
A statistical measure that is often used for goodness of fit for linear regression
models is r 2 , the coefficient of determination. The closeness between a predicted
regression line and the observed data is expressed by r 2 as the proportion of vari-
ance explained. More specifically, r 2 measures how big proportion of the total
variation in the dependent variable is explained by the independent variable, or,
put yet another way, “. . . how much we are reducing prediction error, relative to
how much potential error there is” (Lewis-Beck and Skalaban, 1990: 169). When
GOODNESS OF FIT IN REGRESSION ANALYSIS 233
Table I. Procedures and labels in SPSS for Windows for statistical procedures
used to evaluate linear and logistic regression models

Statistical measure Procedure Label

R2 Multiple linear regression R-square


Adjusted R 2 Multiple linear regression Adjusted R-square
SEE Multiple linear regression Standard error
F -test Multiple linear regression F
Partial F -test Multiple linear regression F -change
G2 (L2 ) Logit log linear Likelihood ratio
chi-square statistic
−2 log L Logistic regression −2 LL
−2 log (Lo/L1 ) Logistic regression Model chi-square
Partial −2 log (Lo/L1 ) Logistic regression Improvement
R 2 analog
Adjusted R 2 analog

the model contains two or more independent variables, R 2 is usually called the
“multiple coefficient of determination” and is then a measure of what proportion
of the total variation in the dependent variable is explained by the entire model, in
other words all the independent variables.
The formula for R 2 is just as easy to grasp as the substantial meaning of the
measure. The formula is written
P
(Ŷ − Ȳ )2
R =P
2
(Y − Ȳ )2

where Y is the observed Y , Ŷ is the predicted Y and Ȳ is the mean (Lewis-Beck,


1980: 52).
Since the numerator is the regression sum of squared deviations (RSS) and the
denominator is the total sum of squared deviations (TSS), the formula can also be
written R 2 = RSS/TSS (see, for instance, Lewis-Beck, 1980).

4.2. R2 AS A GOODNESS OF FIT MEASURE

It is important to emphasize that R 2 is not a measure of the effect of the model on


the outcome; rather it is a measure of the degree of agreement between the former
and the latter. A measure of the “effect” of an independent variable is given by
the regression coefficient (β). The distinction can also be expressed by saying that
R 2 mirrors the relationship between observed and predicted values, whereas the
β-coefficients mirror the relationship between the independent and the dependent
variables.
234 CURT HAGQUIST AND MAGNUS STENBECK

It seems to be more the rule than the exception that books on quantitative meth-
ods describe and refer to R 2 as a measure of goodness of fit (see, for instance,
Schroeder, Sjoquist and Stephan, 1986; Berry and Feldman, 1985; Lewis-Beck,
1980). The rationale for this is that R 2 is a summary measure of the agreement
between the observed and the predicted data. Nevertheless, R 2 has by some been
seen as ‘out of fashion’ in political science (Lewis-Beck and Skalaban, 1990). The
utility of R 2 as a measure of goodness of fit is highly controversial among method-
ologists. Whereas, for example, Lewis-Beck and Skalaban (1990), see R 2 “. . . as
an invaluable tool in quantitative political science . . . ” (p. 168) and unique for pre-
dicting a dependent variable, Christopher H. Achen (1990) asserts that explained
variance does not really mean anything and that “. . . R 2 becomes a meaningless
accident of the sample . . . ” (p. 183). For many social scientists, interpretations in
terms of explained variance have “. . . doubtful meaning but great rhetorical value”
(Achen, 1982: 59).

4.3. R2 AS A DESCRIPTIVE MEASURE

Since R 2 evaluates the agreement between the model and the observed data it is a
goodness of fit measure. But although the measure always has the same 0–1 range,
there is no way of knowing how much variance must be explained in order for the
fit to be good enough. R 2 does not have a known distribution when the residual
unexplained variation is random. Hence, it is not possible to test whether all the
systematic variation has been accounted for.Therefore, although R 2 evaluates the
goodness of fit it is not a decisive goodness of fit test statistic.
Achen (1990) regards R 2 as unusable “. . . for drawing inferences about causal
strength or substantively meaningful goodness of fit (p. 180). In Achen’s words, R 2
is “. . . a purely descriptive quantity with little substantive content” (Achen, 1990:
173).

4.4. R2 AS A SAMPLE - SPECIFIC MEASURE

A critical property of R 2 is that the measure is “sample-specific” in the same way as


correlation coefficients are. The fact that R 2 is sample-specific means that its value
may differ greatly between different samples even when the “causal” relationship
between two variables is the same and all the estimated (unstandardised) regression
coefficients are identical. The reason is that the variance in the dependent variable
may differ between the different samples (Berry and Feldman,1985). Because of
this property, Hanushek and Jackson (1977), for example, advise extreme caution
when interpreting R 2 , especially when comparing R2 from different samples.
The criticism directed at R 2 seems to relate particularly to the use of R 2 in
sciences where non-experimental designs are common. When the independent
variables are not manipulated, the variance is in practice determined by the sample
– unlike the case of experimental designs (Achen, 1982). More specifically, the
GOODNESS OF FIT IN REGRESSION ANALYSIS 235

question concerns the distinction between unconditional and conditional regres-


sion. In the former case, a sample is taken randomly “ . . . over a fixed population of
the independent variables” (Achen, 1990: 179), where the precondition is that the
independent variables are fixed to a given distribution regardless of which sample
is studied. In the latter case the sample is taken “ . . . conditional on the observed
independent variables” (Achen, 1990: 179). According to Achen, the point of the
distinction is that R 2 can only be a meaningful measure under unconditional re-
gression. However, the basis for this is weak in the social sciences, for example,
because the distribution of the independent variables is not fixed, neither in time
nor space (samples).
Under conditional regression, R 2 is a sample-specific measure, that is, a mea-
sure determined by the sampling process and the variability in the independent
variables: ” . . . variance in a dependent variable has little to do with difficulty of
prediction. How much variance appears in a dependent variable is an accident of
sampling, and the same dependent variable with the same ease of prediction may
have large variance or a small variance, depending on how diverse the distribution
of independent variables happens to be” (Achen, 1990: 177).

4.5. R2 AS A STANDARDISED MEASURE

What the critics of R 2 see as its Achilles heel is seen by its supporters as its
strength. In a defence of R 2 , Lewis-Beck and Skalaban (1990) took as their starting
point the need for a statistical measure to read the predictive capability of a model,
in other words, to be a measure of how well the dependent variable can be predicted
from knowledge of the independent variables. According to them, this calls for a
measure which not only measures absolute predictive capability but also relative
predictive capability. As they see it, R 2 has precisely the properties that absolute
prediction measures lack:

– R 2 has a fixed upper limit as well as a fixed lower one.


– R 2 can easily be evaluated and does not require access to other measures.
– R 2 provides a baseline – in the form of a line where a ‘1’ is perfect fit and
‘0’ is no fit all – which makes it possible to judge the predictive capability of
a model. Since Lewis-Beck and Skalaban (1990) also regard R 2 as inherently
a baseline, they are of the opinion that “ . . . a model R 2 may be used as a base
to which estimates from rival models may be fruitfully compared” ( p. 159).

It is clear that the problems surrounding R 2 are only part of a broader complex of
problems relating to the use of standardised measures. In principle, the difference
between an absolute and a relative measure is the same as the difference between
unstandardised and standardised measures. The difference is that the spread of the
independent variables is taken into account in one case but not in the other.
236 CURT HAGQUIST AND MAGNUS STENBECK

Standardised regression coefficients, as well as R 2 , contain and reflect such


differences of variance. Put differently, both standardised regression coefficients
and R 2 are sample-specific, whereas unstandardised regression coefficients are not.
So on this point we agree with King when he states: “ . . . any argument that applies
to standardized and unstandardized coefficients should apply with equal weight to
standardized and unstandardized variability” (King 1990: 187).
The family relationship between the different measures is illustrated by the fact
that, under simple regression, the value of r 2 is equal to the square of the stan-
dardised beta-coefficient. However, different choices are more or less difficult. The
price for choosing unstandardised regression coefficients instead of standardised
ones is not too high. Since standardised regression coefficients must be interpreted
as changes in the unit ‘standard deviations’ it is hard to get an intuitive understand-
ing of the meaning of the beta-coefficient. The price of abandoning correlation
coefficients may be somewhat higher, since they are easy to interpret intuitively.
It may prove harder for the model-building researcher to completely discard R 2 ,
owing to its superficially appealing interpretation possibilities.

4.6. R2 MAXIMISATION

Warnings have been given about regarding the maximisation of R 2 as the aim of
regression analysis (Schroeder et al., 1986). Usually R 2 increases in value for
each independent variable that is added to a model; in any event it does not fall
(Schroeder et al., 1986; Berry and Feldman, 1985). This happens regardless of
whether the variable is relevant in the model or not. This property of R 2 has been
regarded as undesirable. However, R 2 shares this property with all other goodness
of fit measures.
The problem is that no good criterion exists on how high R 2 ought to be before
the fit is good or acceptable. An attempt to resolve the problem of making such a
decision is to ‘factor in’ the actual number of independent variables when calculat-
ing R 2 . The adjusted R 2 obtained in this way can decrease when another variable
are added to the model (Schroeder et al., 1986). But the adjusted R2 can no more
than the unadjusted be given an objective interpretation pertaining to the overall fit
of the model.
It is important not to be carried away by the chase for high R 2 values. With
maximisation of R 2 as a goal it is easy to "lose" independent variables which
are structurally relevant but which do not contribute to raising R 2 very much.
This is especially prudent when ‘mindless’ procedures such as stepwise regression
are used, i.e. when the choice of variables in the model is guided by a computer
program rather than a theoretical idea.
GOODNESS OF FIT IN REGRESSION ANALYSIS 237

4.7. SEE

Alongside R 2 , the textbooks also mention “SEE” or “se” for the evaluation of
goodness of fit for linear regression models. SEE is usually defined as “the standard
error of estimate of Y ” (Lewis-Beck, 1980; Berry and Feldman, 1985), as “the
standard error of the regression” (Achen, 1982: 62) or as “the estimated standard
deviation of the actual Y from the predicted Y ”. (Lewis-Beck, 1980: 37). SEE
is an expression of the estimated standard deviation of the “disturbances” (Achen,
1982). SEE is a measure of how good the fit of a model is, expressed as an “average
prediction error” (Lewis-Beck, 1980). The intuitive or substantial interpretation of
SEE is that it expresses “. . . how far the average dependent variable departs from its
forecasted value” (Achen, 1982: 62). Numerically the interpretation is also simple:
the goodness of fit of a regression with a lower SEE is better than the goodness of
fit of a regression with a higher SEE. The formula for calculating SEE is
s
P
(Yi − Ŷi )2
Se =
n−2

(Lewis-Beck, 1980: 37). The lower limit of SEE is zero; it has no fixed upper limit.
Unlike R 2 , SEE is neither standardised nor sample specific. SEE is an estimate
of the average agreement between the predictions and the true population values of
Y . In other words, SEE is not affected by accidental differences in the variances of
the independent variables. It has the same unit of measurement as the dependent
variable. This permits comparisons of models across different samples.
The main objection to SEE is that it is hard to interpret as an independent
measure, since a single SEE in itself does not indicate whether the fit is good
or bad. Lewis-Beck and Skalaban (1990) are of the opinion that “. . . SEE is not
self-sufficient as a measure of relative predictive capability” (p. 157), unlike R 2 ,
which has exactly the properties that SEE lacks (Lewis-Beck and Skalaban, 1990),
i.e., the common scale across realizations. The differences between R 2 and SEE
as measures of goodness of fit is illustrated by the fact that different measures
can lead to different results. A model with a lower SEE and thus a better fit than
another model in terms of SEE, may have a lower R 2 and thus a worse fit. This
inconsistency between the different measures when evaluating goodness of fit is
due to the fact that the variations in the independent variables are greater within
the sample with higher R 2 .

4.8. F- TESTS
It is possible to ‘convert’ R 2 into a test statistic. The ratio of the average explained
to the average residual variance is a statistic which has an F distribution if the
explanatory variables taken together do not contribute to an increase of the fit of a
baseline model by more than what would be expected by chance.
238 CURT HAGQUIST AND MAGNUS STENBECK

The reference data for the F -test is not the observed data but a summary
measure of them. To quote a statement from Hanushek and Jackson (1977) in
connection to R 2 “. . . it is simply a comparison of the estimated systematic model
with a very naïve model, namely the mean of the observed values of Y t” (p. 58,
italics added).
Hence, the mean of Y can be regarded as a model, namely the model of in-
dependence between all the joint X’es and the outcome Y . In fact, the F -test
can be used to test any pair of nested models. The choice of baseline model is
indeed the problem with this approach. The model of independence is no more
natural than any other baseline. As Hanuschek and Jackson (1977) say, it is often
‘naïve’ to expect Y to be independent of all the X’es. The choice of comparison
becomes arbitrary or at least dependent on the situation. In this sense it differs from
a goodness of fit measure which has its natural baseline in the observed data. Albeit
a model test, the F -test is therefore not a goodness of fit test in the absolute sense.

5. Goodness of Fit in Logistic Regression


The methodology literature has paid considerable attention to measures of good-
ness of fit, not only for linear regression analysis but also for logistic regression
analysis. There has apparently been no parallel to the R 2 controversy in the field
of logistic regression. In fact, much work has gone into finding statistical measures
comparable to R 2 for logistic regression analysis (e.g., McFadden, 1974). The lack
of any direct equivalent to R 2 is mentioned as one of the shortcomings of logit (and
probit) analyses (Hagle and Mitchell, 1992).

5.1. LIKELIHOOD RATIO STATISTICS

The likelihood ratio statistics play an important role in logistic regression analysis.
The expression may lead to confusion, since it it used for two different applications,
one of which is a goodness of fit test statistic and another which is not. The like-
lihood ratio goodness of fit test applies to contingency table analysis. It is referred
to by some as G2 (see, for instance, Bishop et al., 1975; Demaris, 1992; Fienberg,
1980; Gilbert, 1993) and by others as L2 (see, for instance, Clogg and Shihadeh,
1994; Knoke and Burke, 1980). In the terminology used in GLM a further expres-
sion occurs, “the deviance” (McCullagh and Nelder, 1989; Agresti, 1996). Another
application of the likelihood ratio statistics is a measure closely related to the F -test
in linear regression. It seems that this measure is usually called −2 log (L0/L1),
but it is sometimes referred to as the model chi-squared (Demaris, 1992; SPSS
1993).
The two measures use the same quantity, a measure of deviation. The maximum
likelihood method chooses parameter estimates on a predefined set of parameters
such that the likelihood of observing the sample data is maximized. This maximum
likelihood is used for assessing the model fit.
GOODNESS OF FIT IN REGRESSION ANALYSIS 239

5.2. G2
The G2 test relates the loglikelihood value of a specific model to the loglikeli-
hood value for a completely unrestricted model, the saturated model. The saturated
model fits the observed data exactly. Hence, G2 measures the agreement between
the observations and the data generated by an unsaturated model. The formula for
G2 is
XX
G2 = 2 nij log(nij /m̂ij )

where nij are the observed frequencies and m̂ij are the expected frequencies in
the ith row and the j th column (Agresti, 1990: 48).This statistic is chi square
distributed if the residual variation is random and if the expected number of obser-
vations in each cell of the contingency table is sufficiently large. Hence, under quite
general conditions it is possible to evaluate whether it is probable that the obtained
statistic comes from a chi squared distribution. If this hypothesis cannot be rejected,
the conclusion is that the overall model fit is acceptable. The distributions of the
chi squared statistics are not dependent on the sample but only on the number
of degrees of freedom of the model. Therefore, an objective evaluation of model
fit is possible. The point value of the statistic is partly due to random variability
depending on the specific sample at hand. Therefore, it is not possible to directly
compare chi squared statistics obtained from different samples. Nevertheless, G2
is a goodness of fit test statistic and therefore it plays a central role in the analysis
of table data.

5.3. m- ASYMPTOTIC VERSUS n- ASYMPOTIC MEASURES

Similar to R 2 in linear regression the −2Loglikelihood obtained from the logistic


regression is sample specific, but unlike R 2 it cannot be standardised. However, key
issues as regards this measure is its interpretation and how it can be used. In fact,
−2 log L(L1/LS), the difference between the specified model (L1) of interest
and the saturated model (LS), would be a conceivable goodness of fit measure for
logistic regression on individual data. Since −2 log L(LS) = 0 and the goodness
of fit test statistic G2 can be regarded as a special case of −2 log L, where the
“reference model” is the saturated model (LS) (Agresti, 1990) is it possible to
interpret −2 log L(L1) as a goodness of fit test statistic? The answer is no. The
reason is that the G2 statistic is m-asymptotic, whereas −2 log L is n-asymptotic.
The parameter estimators of the logit model are normally distributed in large
samples (Wald, 1943 in Agresti, 1990). This is also true of the estimator of the
joint effect of the entire model. This means that the saturated model is a valid
baseline for a test if and only if large sample properties apply to it. In order for this
to happen, a minimum number of observations is required in each cell of the table.
Since the table size (m) is fixed when the sample size (n) increases, the saturated
model has a limited number of parameters. Collecting more observations increases
240 CURT HAGQUIST AND MAGNUS STENBECK

the sample size in each cell and for each parameter of the saturated model. Hence,
with a sufficient sample size, tests of model fit using the saturated model as the
baseline becomes possible.
In logistic regression with one or more continuous independent variables the
saturated model cannot fulfill the sample size requirements as the observed mul-
tivariate sample space increases along with the number of observations. In other
words, the size of the saturated model is not fixed but increases with n (Hosmer and
Lemeshow, 1989). The minimum number of observations in order for large sample
properties to apply can only be achieved if the model is restricted. A non-saturated
model is necessary. This rules out the saturated model as the baseline model. In
contrast, the intercept-only (L0) model has one parameter regardless of the sample
size, such that increasing the number of observations increases the statistical power
of the estimated intercept. Therefore, with a sufficient total number of observations
the comparison between L0 and L1 is a valid test with a known distribution, but
the comparison between LS and L1 is not.
A restricted baseline model must be used in order to meet the distributional
requirements. But any restriction on the baseline model rules out the observations
as the reference model. Hence, it is not possible to define a strict goodness of fit
test statistic for logistic regression with continuous independent variables.
To circumvent the difficulties with n-asymptotic distributions, grouping of data
has been suggested as a possible approach (see Hosmer and Lemeshow, 1989;
Agresti, 1996).
Hosmer and Lemeshow discuss two different approaches, both based on esti-
mated probabilities. They advocate the strategy to group data in a table defined by
percentiles of the estimated probabilities. Simulations indicate that the test-statistic
approximates a chi-squared distribution when the model is correctly specified
(Hosmer and Lemeshow, 1989). However, this involves collapsing the continuous
variable such that the model differs from the original specification.

5.4. RELATIVE FIT TESTS

Likelihood ratio tests can be used for other comparisons than with the saturated
model – as already indicated, other nested models can be compared with each other.
The difference between the G2 values for different models is chi2 -distributed when
all the added parameters for the added variables are equal to zero (Fienberg, 1980).
If the difference between the G2 values of the two models is non-significant, the
more parsimonious model can be chosen, without committing Type II errors.
In GLM parlance, the deviances of two models (deviance0 and deviance1) are
compared (Agresti, 1996). This comparison of two values can also be done using
values of −2 log(L0/L1), in which case it is equivalent to the partial F -test of
linear regression analysis. Like the F -test, −2 log(L0/L1) makes it possible to
test for the effect of individual parameters which are added to nested models, even
when one of the independent variables is continuous. Although it is a model test,
GOODNESS OF FIT IN REGRESSION ANALYSIS 241

the −2 log L(L0/L1) does not test the goodness of fit, i.e. it does not compare
the model with the observed data. Just like the F -test it is a test statistic but not
a goodness of fit measure. However, it can be used in the same way as an F -test
by comparing the the likelihood values obtained from two hierachically related
models. The difference is evaluated against the chi square distribution. The F and
−2 log (L0/L1) statistics are de facto almost identical; in fact the former can
“. . . be derived from the likelihood ratio principle” (Aldrich and Nelson, 1984: 89).
Both the F and the −2 log (L0/L1) tests are used to judge whether the the joint
parameters of the model have any effect on the dependent variable. In other words
the −2 log(L0/L1) measures “. . . whether any of the predictors are linearly related
to the log odds of the event of interest” (Demaris 1992: 47). What is tested is the
hypothesis that all parameters except the intercept are equal to zero. The model
statistic can be evaluated against a chi-squared distribution (Demaris, 1992). This
test is sometimes described as a goodness of fit test (Aldrich and Nelson, 1984).
The different values of −2 log(L0/L1) across two logistic models can be eval-
uated with partial tests. This is analogous to partial tests of differences between F
statistics in linear regression.

5.5. R2 “ANALOGS ”
The −2 log L is not a standardised goodness of fit measure. There have been some
attempts to rescale it. For instance, McFadden (1974) suggested

D = (log L0 − log L1)/ log L0

which always falls between 0 and 1.


This is a special case of the measure sometimes called “R 2 -analog” (Knoke
and Burke, 1980), sometimes R 2 L (Hosmer and Lemeshow, 1989; Menard, 1995).
Like R 2 in linear regression analysis, this measure is also interpreted in terms of
explained variance. R 2 -analog/R 2 L can be used in table analysis as well as in
logistic regression analysis with continuous variables.
There would not appear to be any consensus about the usefulness of R 2 type
measures (Aldrich, 1984; Menard, 1995). Agresti (1990) shows that the R 2 -analog
confounds goodness of fit and strength of association, and that if the baseline model
(e.g., the intercept model) does not fit the data the measure will depend on the
sample size such that the explained variance would appear to increase as the sam-
ple grows. However, Monte Carlo experiment in recent years, in which R 2 -analog
measures were compared with R 2 in OLS, have resulted in positive outcomes for a
few of the measures (Hagle and Mitchell, 1992).
The R 2 -analog has also been suggested as a way of solving the problems that
may arise when evaluating the goodness of fit of contingency table models in
large samples. With a large sample it is often easy to reject the null hypothesis
of no remaining systematic variability. Knoke and Burke (1980) argue that high
values on the R 2 -analog (>0.90) can be taken as evidence of model fit in spite of
242 CURT HAGQUIST AND MAGNUS STENBECK

model failure according to R 2 . This opinion has been sharply criticized by Duncan
(personal communication, 1985) who regards it as an expression of “statisticism”.
The R 2 -analog is neither a model parameter, nor a test statistic, nor a measure of
effect, nor a goodness of fit measure. Hence, it has no clear substantive meaning. In
contrast, the G2 is independent of the sample size and its power to detect systematic
deviation between model and data increases as the sample size grows large.

6. Summary and discussion


Initially we chose to define goodness of fit as the agreement between observed
data and data generated by a model. Goodness of fit in this sense is important
in regression analysis, but not in all situations. In linear and logistic regression
analysis the applicability of goodness of fit measures depends on the approach –
model oriented or factor oriented. In table analysis the goodness of fit measures are
always important independently of the scope of the analysis.
For the non-statistician it is often difficult to appreciate the importance of differ-
ent statistical tools in practical work. Our classification of measures aims at sorting
out decisive from heuristic devices in research work. According to our opinion it
is important to distinguish goodness of fit measures from relative fit measures for
comparing nested constrained models. It is also important to distinguish between
test statistics and descriptive measures. Using our classification scheme on the
discussed measures leads to the following results:

• G2 is a goodness of fit test statistic. As such it is a decisive tool in table analysis.


• The R 2 , SEE, and −2 log L(L1) are descriptive goodness of fit measures.
They may be helpful for assessing model fit in a heuristic sense when such an
assessment is of interest.
• The F -test, partial F -test, −2 log (L0/L1) and the partitioned G2 are decisive
tools for nested model tests if it is plausible that the partial test is not biased,
i.e. when at least one of the models tested fits the data. This can be judged
when using G2 , but is more difficult for the other measures.
• Some R 2 -analogs are descriptive measures for the comparison of nested
models.

The conclusion is that in a strict sense goodness of fit can be tested only in
table analysis. This possibility has been called “. . . one of the attractions of cate-
gorical data . . . ” (Clogg and Shihadeh, 1994: 8). For linear regression analysis of
continuous data there are only heuristic tools for the evaluation of goodness of fit.
For logistic regression with continuous independent variables the situation may be
even worse.
For comparisons of fit between different nested models there are decisive tools
and valid tests in more situations, notably for linear regression and logistic regres-
sion of table as well as individual data. It is important to point out that this is only
GOODNESS OF FIT IN REGRESSION ANALYSIS 243

true for the situation in which one of the models fits the data. The latter can only be
tested in the table analysis. In other contexts it is a matter of theoretical arguments.
However, even if logistic table analysis, in comparison with linear regression
analysis, offers better scope for testing and comparing models, G2 does not solve
the problem of wrongly-specified models. As a measure of goodness of fit, G2 is
excellent provided only that the table on which the analysis is based is correct. The
decision to analyze a specific contingency table rather than another is not objective
but theory-driven. The strength of the theory underlying the table construction here
becomes decisive in the same way as the decision on model specification in the case
of individual analysis. Both decisions are critical and both are untestable.
Sometimes collapsing of continuous regressors occurs in order to make the
tools of table analysis available. We find this practice highly unadvisable unless
supported by strong theoretical motivations.
The lack of goodness of fit tests in individual level regression is not a coinci-
dence. It reflects a basic contradiction between the requirements of goodness of fit
measures and those of statistical tests. A goodness of fit measure is supposed to
compare the observations with predictions derived from a model. The goal is to
assess whether the predictions are sufficiently close to the original data. According
to our opinion, the saturated model is therefore the only possible baseline model
for a goodness of fit measure.
A test statistic, on the other hand, is supposed to distinguish between random
and systematic variability. Applied to goodness of fit, the test is used to deter-
mine whether there is some systematic variability left in the data that the tested
model failed to account for. In order for the test to meet the necessary statistical
assumptions the number of observations across the multivariate sample space must
be large enough for the baseline model. But in individual level regression there is
only one observation per combination of x-values. Therefore, the goodness of fit
measure does not have a known distribution under randomness. To achieve large
sample properties for the test one must use an unsaturated model as the baseline
model (such as in hierarchical F -test and the −2 log(L0/L1) test), or one must
group the data into classes (e.g., percentiles as in the test proposed by Hosmer and
Lemeshow (1989). But the resulting tests are not goodness of fit tests in the strict
sense of the word. In other words, both criteria cannot be fulfilled simultaneously.
If one requirement is met, the other must be sacrificed.
Other conclusions with respect to fit measures used in linear and logistic
regression are
• R 2 , F , −2 Log(L0/L1) and G2 have scales which are independent of the scale
of Y , whereas SEE and −2 Log L does not.
• R 2 , F , −2 Log L, −2 Log(L0/L1) and G2 are sample specific measures,
whereas SEE is an estimated property of the estimates.
Sample specific and standardised measures like R 2 must always be interpreted
in the light of the variance differences that may occur. Hence, their values cannot
244 CURT HAGQUIST AND MAGNUS STENBECK

be compared across samples. R 2 may be a useful tool for comparisons between


different models in the same sample. Partial F -tests can decide whether or not the
extra parameters added to a model increase the fit significantly, but cannot give
decisive answers with respect to overall model fit.
Non-standardised and non-sample specific measures like the SEE give addi-
tional information in model evaluation, and can be used for heuristic comparisons
of models within and across samples for the same outcome variable. However,
no objective criterion on how large and/or small errors are acceptable exists here
either.
Some final remarks remain:
• In our view, the conclusions drawn from the debate about goodness of fit for
structural equation models and the recommendations which emerged from that
debate – if possible not to rely on one measure of goodness of fit but on several,
not only to report overall fit but also to evaluate individual components such
as parameter estimates (Bollen and Long, 1993) – are generally applicable and
valid for regression analyses as well.
• Whatever the researcher’s approach and whatever goodness of fit measure is
chosen, the results of the regression analysis must be interpreted subject to
an important proviso, namely that the model or the table has been correctly
specified. This is one of the basic assumptions on which regression analysis
rests, and the basic theoretical underpinnings of empirical work are only par-
tially testable regardless of whether one uses contingency tables or individual
observations as the point of departure.

References
Achen, C. H. (1982). Interpreting and Using Regression. Newbury Park: Sage Publications.
Achen, C. H. (1990). What Does “Explained Variance” Explain?: Reply, Political Analysis, 2. Ann
Arbor: The University of Michigan Press, pp. 173–184.
Agresti, A. (1996). An Introduction to Categorical Data Analysis. New York: John Wiley and Sons.
Agresti, A. (1990). Categorical Data Analysis. New York: John Wiley and Sons.
Aldrich, J. H. & Nelson, F. D. (1984). Linear Probability, Logit, and Probit Models. Newbury Park:
Sage Publications.
Berry, W. D. & Feldman, S. (1985). Multiple Regression in Practice. Newbury Park: Sage
Publications.
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete Multivariate Analysis. Theory
and Practice. Cambridge: The MIT Press.
Bollen, K. A. & Long, J. S. (1993). Introduction. In: K. A. Bollen & J. S. Long (eds), Testing
Structural Equation Models. Newbury Park: Sage Publications.
Clogg, C. C. & Shihadeh, E. S. (1994). Statistical Models for Ordinal Variables. Thousand Oaks:
Sage Publications.
Demaris, A. (1992). Logit Modeling. Practical Applications. Newbury Park: Sage Publications.
Duncan, O. D. (1985). Personal letter to David Burke.
Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data. Cambridge: The MIT
Press.
GOODNESS OF FIT IN REGRESSION ANALYSIS 245

Gilbert, N. (1993). Analyzing Tabular Data. Loglinear and Logistic Models for Social Researchers.
London: UCL Press.
Hagle, T. M. & Mitchell, G. E. (1992). Goodness of fit measures for probit and logit. American
Journal of Political Science 36: 762–784.
Hanushek, E. A. & Jackson, J. E. (1977). Statistical Methods for Social Scientists. Orlando:
Academic Press.
Hosmer, D. W. & Lemeshow, S. (1989). Applied Logistic Regression. New York: John Wiley and
Sons.
King, G. (1986). How not to lie with statistics: Avoiding common mistakes in quantitative political
science. American Journal of Political Science 30: 666–687.
King, G. (1990). Stochastic Variation: A Comment on Lewis-Beck and Skalaban’s “The R-Squared”.
Political Analysis, 2. Ann Arbor: The University of Michigan Press, pp. 185–200.
Knoke, D. & Burke, P. J. (1980). Log-linear models. Newbury Park: Sage Publications.
Lewis-Beck, M. S. (1980). Applied Regression. An Introduction. Newbury Park: Sage Publications.
Lewis-Beck, M. S. & Skalaban, A. (1990). The R-Squared: Some Straight Talk. Political Analysis,
2. Ann Arbor: The University of Michigan Press, pp. 153–171.
McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. London: Chapman and Hall.
McFadden, D. (1974). Conditional Logit Analysis of Qualitative Choice Behavior. Frontiers of
Econometrics. New York: Academic Press, pp. 105–142.
Menard, S. (1995). Applied Logistic Regression Analysis. Thousand Oaks: Sage Publications.
Schroeder, L. D., Sjoquist, D. L., & Stephan, P. E. (1986). Understanding Regression Analysis. An
Introductory Guide. Newbury Park: Sage Publications.
SPSS (1993). SPSS for Windows. Advanced Statistics Release 6.0. Chicago: SPSS.
SPSS (1993). SPSS for Windows. Base System User’s Guide. Release 6.0. Chicago: SPSS.
SPSS (1994). SPSS 6.1 for Windows update. Chicago: SPSS.

Das könnte Ihnen auch gefallen