Sie sind auf Seite 1von 10

Binary Logistic Regression Using SPSS

We consider an example given in Myers, Gamst & Guarino (2006), Applied Multivariate Research: Design and Interpretation, by
Sage Publications. In this example, we want to predict psychiatric commitment to a state hospital (yes, no) based on
predictors minority status (yes, no), educational level, and stress test. The data, once entered into SPSS, should look like
this:

The total data set consists of 354 cases, of which the first 25 are shown above. The dependent or response variable is
"committed," while the predictors are educ, minority, and stress. Notice that both educ and stress can be regarded as
continuous variables, whereas minority is a dichotomously scored variable. Logistic regression can handle both types of
predictors. However, the response variable in a binary logistic regression must be dichotomous. We confirm that the variable
"committed" is dichotomously scored. If the response variable was not dichotomous, there is nothing preventing you from
forcing it into two categories (such as splitting groups at the median). However, you would have to also ask yourself why you
would want to do this. Ordinarily, if you have a continuous response, then you usually have a more sensitive measure of the
variable you are assessing, and it would usually (not always) be unwise to "chunk" that information into two coarse
categories. For instance, if you have data on IQ and the range is 50 to 150, there would be little reason for chunking that
information into two categories (low IQ, high IQ) and running a logistic regression instead of an ANOVA or regression or
other variants of the linear model.

Running the Logistic Regression

The first step to running the logistic regression is to select "Regression," then "Binary Logistic." We choose "Binary
Logistic" because our response variable has only two categories. Had it more than two categories, "Multinomial Logistic"
would have been an option. And if it had many, many "categories," such that we could theoretically consider the variable
continuous, least-squares regression would usually be a better option than multinomial logistic. Pay notice to this
"transition"fromwhatisconsideredadiscretevariablevs.whatisconsideredacontinuousvariable.
Once the "Logistic" window opens, you'll want to move over the "Committed to Hospital" (i.e., "committed") variable to
"Dependent," and "Educational Level," "Minority Classification" and "Stress Level" variables to "Covariates." SPSS calls
these variables "covariates," but you can also think of them as simply "predictors," as one would in ordinary least-squares
regression analysis. All predictors in virtually every model you run can be regarded as covariates (because they presumably
covary with the dependent variable) though the actual term "covariate" is often reserved for including nuissance variables in
analysisofcovariance(ANCOVA)models.InSPSSlogistichowever,thesepredictorsarenamedascovariates.

Since "minority" is a dichotomously-scored variable, you can select the tab "Categorical" and enter that variable there.
However, since it is a dichotomous variable and has only two categories, you can leave it under "Covariates." If there were 3
categories or more, then putting it under the "Categorical" tab would be necessary so SPSS could produce the relevant dummy-
coding scheme for it. With only two categories though, it's just fine under "Covariates." This is because if you have only
two categories for a categorical variable, you will still only have a single vector (column) of scores for it. In other
words, SPSS does not have to work with the variable to produce a new set of vectors to represent the categorical variable.
However, if you had 3 or more categories, then typically you would need to invoke a dummy-coding scheme, which SPSS does
automaticallyifyouspecifythevariableas"categorical"underthetababove.

Next, select the "Save" tab, and for the sake of exhaustion and completeness, let's check off everything available. We
discuss most of these things in the ensuing output:

Next, select "Options" from the "Logistic Regression" window, and check off everything in the window. Again, we discuss what
these things mean in the output, and we'll refer back to these windows when discussing the results.
Once you've hit "Continue" you're ready to run the logistic regression. Select "Ok" in the "Logistic Regression" window, and
SPSS will run the analysis.

Interpreting the Output

The first bit of output will be the syntax used to execute the commands you specified through windows. Recall that you will
only get this syntax as output if you've configured SPSS to provide this for you. Here's the syntax for the present analysis:

LOGISTICREGRESSIONcommitted
/METHOD=ENTEReducminoritystress
/SAVE=PREDPGROUPCOOKLEVERDFBETARESIDLRESIDSRESIDZRESIDDEV
/CLASSPLOT/CASEWISEOUTLIER(2)
/PRINT=GOODFITCORRITER(1)CI(95)
/CRITERIA=PIN(.05)POUT(.10)ITERATE(20)CUT(.5).

We won't go through the syntax now, but as we interpret the output below, you might want to check back and forth with the
syntax to identify the commands with the relevant output. You'll easily match things up with the commands above, since the
syntax contains the actual commands SPSS uses to produce the output. The syntax is actually quite straightforward once you
understandthecommands.

Now, onto the first real pice of output from our analysis:

SPSS first provides a basic descriptives window showing how many cases were analyzed, and how many cases were missing. As can
be seen above, all 354 cases in this analysis were used, with zero cases missing, and none unselected. What's the
"unselected" about? Sometimes in analyses we wish to analyze only some cases and omit others. For instance, in one analysis
you may be interested in only analyzing females and not males. If that were the case, SPSS would report to you cases that
were "unselected." If there were 40 males in your data, SPSS would report unselected cases equal to 40. SPSS also provides
the relevant percentages as well. In the present example, 100% of cases were analyzed.

Next, SPSS tells us how the dependent variable was "encoded." What this simply means is that it is telling us what values you
used to name the categories on the dichotomously scored DV. In the current example, the response "No" was coded "0" and the
response "Yes" was coded "1." Did you have to code like this? Absolutely not. You could have coded "No" equal to "1" and
"Yes" equal to "0." What you call a "success" in binomial lingo is entirely up to you. However, usually we're more interested
in the "Yes" response than we are the "No" response, as we are in the current example. For instance, if you're modeling
survival, you usually want to model those who survive as equal to 1, and those who do not equal to 0. What would you rather
model? Would you rather model the number committed to psychiatric treatment, or would you rather model the number not
committed? It usually is more intuitive to model the number committed, and so as Meyers et al., do in this example, they
choose "Yes" as the "success" in this case. As a general rule, whenever you do a logistic analysis, ask yourself which value
of the dependent variable is of most interest to you, and code that value equal to 1.0. This won't apply in all cases, but
usually this is a good rule to follow. Coding this way also faciliates the interpretation of parameter estimates which will
bediscussedshortly.ItmakestheirinterpretationmoreintuitivethanifyoucodedtheDVdifferently.

Next, SPSS reports that it begins analyses at "Block 0." This is the first analysis and it contains NO predictors. It
contains only the constant in the model, and therefore, is of not much use to us. We're more interested in the predictive
power of the predictors entered rather than a model with only an intercept term. However, it doesn't mean the above output
should be completely disregarded. Let's consider what it's actually telling us. Notice on the left-hand side, it reports that
SPSS has conducted two "iterations" at step 0. What this means is that the algorithm used to compute the estimates for the
constant term took two "rounds" (iterations) to settle on an estimate that it deemed satisfactory. What's "satisfactory"?
SPSS stops "iterating" when there is such a small change between iterations as to be negligible. That's what it's telling us
in note "c" above, "Estimation terminated at iteration number 2 because parameter estimates changed by less than .001." That
is, at iteration 1, the algorithm estimated a constant term of -.169. At iteration 2, it "refined" the estimate into a
constant term of -.170. Since the change from iteration 1 to 2 was so small (less than .001), SPSS stopped "iterating" and
settled on the coefficient of -.170. In other words, it deemed the estimation process complete for a model with only the
constant term included, and reported its final value of -.170.

What is the "-2 Log likelihood" all about? This statistic is read exactly as it appears, "negative 2 multiplied by the Log
likelihood." Without going into much detail here, the log likelihood is a statistic used in estimation, kind of like chi-
squared statistics are used in other contexts. The log-likelihood value of 488.203 is not telling us much as is. However, as
noted by Meyer et al., it serves nicely as a "baseline" value by which we can compare other models (e.g., models that
actually contain our predictors of interest). So, let's remember this value of 488.203, and observe what happens in "Block 1"
when we include predictors into our model. If those predictors actually have predictive value, the "-2 Log likelihood" should
DROP(notgoup).

Next up, SPSS reports a classification table for the dependent variable based on how well the model does with only the
constant included. So, how well does it do? Not very well. Notice that with the constant-only model, the predictive power of
the logistic regression is perfect for those not "committed to hospital," with 192 cases correctly classified, but horrible
for those "yes" committed to hospital. For those 162 cases, the constant-only logistic regression correctly classified 0% of
them. Overall classification for the model is equal to 192/354 = 54.2%. What this means is that the model doesn't do a good
job at classifying cases, which actually is expected at this stage of the analysis since we only have the constant term
included in the model. That is, we've yet to use our predictors to aid in classification, we've yet to use our predictors to
sharpen our predictive power. So, not to disparage if your constant-only model doesn't do a good job, it's usually not
supposed to.

Next, SPSS reports the estimation parameters associated with step 0 (i.e., the "constant-only" model). Notice that SPSS is
telling us that it's reporting on the "Step 0 - Constant" model. The "B" is equal to -.170 (which matches up with the value
of -.170 earlier seen for iteration 2 under "iteration history." What is this "B"? We'll interpret it when we have predictors
in the model, and you'll learn exactly what it means. The "S.E." is the standard error for the "B," (i.e., how much sample to
sample fluctuation in the estimated parameter we can expect in the long run if we were able to take an infinite number of
samples). The "Wald" is an inferential statistic which as noted by Myers et al., is similar in concept to the t-test. It's a
statistic calculated in this case on 1 degree of freedom (df = 1). When we evaluate this statistic, we find it not
statistically significant at p < .05. The probability of observing a statistic of 2.536 or more extreme is equal to .111, so
about 11% of the time in repeated sampling will we see such a statistic from its respective sampling distribution. That's
actually not very often, and notice that we're nearing the 5% level, but we're probably not close enough to deem this
particular constant-only model "important" by any stretch. The "Exp(B)" we'll interpret when we consider more predictors. In
sum, this window of "Variables in the Equation" is telling us that not much is going on with the constant-only model, and
that coupled with the classification table, the model isn't very useful.

The next piece of output provided by SPSS is that of "Variables not in the Equation." This information is kind of
"superfluous" to include at this point, and merely tells us that had we included "education" and "stress" into the model,
they would both contribute to predicting the dependent variable. Notice that each boasts a p-value of .000 and .000
respectively (though we're rounding, the p-value is never, ever equal to 0 exactly), while "minority" is not statistically
significant (p = .643). Why do I say this is somewhat "superfluous" at this point? Because the output that follows below that
we're going to look at will basically tell us the same information. That is, we're going to test models with these predictors
in them anyway, so the above information at most is a "window to the future" of what will follow shortly in our output. At
minimum, it more or less tells us that these variables may definitely be meaningful to us, and that we shouldn't stop
interpreting the logistic regression at the intercept-onlyphase.Weshoulddefinitelypushon.

Let's consider these variables further with the first true block of our analyses. SPSS will report it with the information
below, and it will note that we've chosen the method of "Enter" which means we're entering all of our variables instead of
using a stepwise or hierarchical format:

SPSS then reports again an "Iteration History," but this time, it does so with predictor variables "education," "minority,"
and "stress" included in the model. Conceptually however, the algorithm is doing a similar thing that it did at block 0, it's
trying to converge on estimates that it deems are "narrow" enough to stop the iteration process. Look at the "-2 Log
likelihood value," and how it gets smaller as the number of steps increases. It goes from 388.707 to 363.373, stopping at
step 6. The coefficients for "constant," "education," "minority" and "stress" are best estimated at step 6, where SPSS deemed
that the difference between values at step 5 and step 6 were less than .001 (as noted in note "d"). Actually, it looks like
the estimate didn't change at all from step 5 to step 6, but that's simply because of the number of decimal places SPSS is
reporting. If you want to see the differences between the values, double-click on the output box, then double-click on values
of the coefficients. For instance, let's double-click on the 3.626 under "stress" at step 5. This is what we see:

Similarly, if you double-click on the value of 3.626 for step 6, you'll see:

Notice that we do have a difference between steps 5 and 6 after all, and we can see this when we force SPSS to provide us
with more decimal places. Keep in mind that all of this iteration history isn't that important in terms of evaluating your
model, but it should be checked each time you do an analysis to ensure that the algorithm has performed correctly. Sometimes,
when modeling extremely messy data with many missing values, the algorithm will not converge, and you will receive an error
message (or the software will keep trying to converge indefinitely). If you check your iteration history, you'll spot any
serious problems of this kind before going on to interpret your actual output. If the iteration does not converge on a value,
and you see an error message, you should not automatically proceed to interpret the output (seek help if the iterations do
not converge). Okay, enough of interation history, let's get on to the actual analysis with predictors in the model.
SPSS next reports the "Omnibus Tests of Model Coefficients." "Omnibus" means "overall," and so this output is simply telling
us whether the model with 3 predictors (education, minority, stress) predicts the dependent variable better than chance
alone. In a very crude sense, the above table is telling us whether this model is worth looking into further (again, I said
"very crude sense," since there may be times when you'll want to look at the model further regardless of the above output).
SPSS designed the above table for cases in which you are performing a stepwise or hierarchical logistic regression, which is
why you see "Step" and then "Block". Had we been performing a stepwise or hierarchical logistic regression, we would along
the way see differences between the "Step" "Block" and "Model" numbers. But for this example, in which we used the method
"enter," we see the same values. What the above is telling us is that the model with 3 predictors does better than chance at
predicting the dependent variable, and is statistically significant at p < .001. How is the chi-square value of 124.829
computed? It is computed by taking a difference between the log likelihood at block 0 and the log-likelihood at block 1, and
assuchisequalto488.203minus363.373=124.83.

Next, we come to "Model Summary," which are summary statistics for the model at "Step 1," which recall is the model with 3
predictors. The first statistic should look familiar, it is the -2 Log likelihood value, and is equal to 363.373. Compare
this value to that of the "constant-only" model. That -2 Log likelihood value was equal to 488.203. So, we see that with the
inclusion of the 3 predictors, the -2 Log likelihood value has DECREASED (the difference being 488.203-363.373 = 124.83, the
value of chi-squared above). This is a good thing, we want it to decrease. In many models, fit statistics such as these we
desire them to be low. In structural equation modeling for instance, we habitually want the value of chi-squared to be low
for good-fitting models. Similarly in this case, we want the log-likelihood value to be smaller at Step 1 than at Step 0,
because if it is, then it's reflective of a potential gain in model fit. Again, SPSS is proud to tell us its iteration
history (but haven't we had enough of that already?), informing us that it terminated at iteration 6. We already knew that
from previous output though.

Next is the "Cox & Snell R Square" value of .297. This statistic is referred to as a "pseudo-R" statistic, in that it is
designed to tell us something similar to what R-squared tells us in ordinary least-squares regression, that of the proportion
of variance accounted for in the dependent variable based on the predictive power of the independent variables (predictors)
in the model. However, it should never be interpreted exactly as one would interpret R-squared in OLS (ordinary least-
squares) regression. That's why we call it "pseudo," and at best, it's an approximation to telling us something similar to R-
square. Overall, high values are better than low values here, with higher values suggesting that your model fits increasingly
well. Next is the "Nagelkerke R Square" statistic, and like the "Cox & Snell R Square," it is also a "pseudo" R-square value,
purporting to tell us something along the lines of an OLS R-square, but not directly comparable to it. Cohen, Cohen, West &
Aiken (2003) actually call these statistics "Multiple R-squared Analogs" to emphasize that they are not equivalent to the R-
squared in OLS regression. As they note, ". . . we caution that all these indices are not goodness of fit indices in the
sense of "proportion of variance accounted for, in contrast to R-squared in OLS regression" (p. 503). For further details on
this issue, see Cohen et al., pp. 502-504. One thing that you need to know is that the Nagelkerke R-square will always be
less than the Cox & Snell R-square, since the Nagelkerke R-square is an adjustment of the Cox & Snell, for which the maximum
value it can attain is equal to 1.0. The maximum value for the Cox & Snell is 0.75. Cohen et al., suggest that you report
this issue if you interpret the Nagelkerke R-square in your research, since it's tempting to simply report the larger value
(.397 looks more impressive than .297, but this is simply a matter of scaling, not actual "size" difference. Again, see Cohen
et al., pp. 502-504 for details. At minimum, be sure to not equate these values one-to-one with R-squared values in OLS
regression.

Next up we come across the "Hosmer and Lemeshow Test":

The "Hosmer and Lemeshow Test" is a measure of fit which evaluates the goodness of fit between predicted and observed
probabilities in classifying on the DV. Similar to the -2 log likelihood test, we want this chi-squared value to be low and
non-statistically significant if the predicted and observed probabilities match up nicely. In this case however, we see that
the test is statistically significant (p < .001), suggesting that the probabilities of predicted vs. observed do not match up
as nicely as we would like. SPSS reports the details of the Hosmer and Lemeshow Test below. Notice that observations have
been divided up into 10 groups. These 10 groups were defined based on predicted probabilities. What is important to note here
is the agreement or lack of agreement between observed and expected frequencies. Notice that the agreement isn't that great.
For instance, in partition 6 for "Committed to Hospital = No," the observed is equal to 24, while the expected is equal to
19.943, not a good match at all. Similarly for other partitions, the "fit" between observed and expected is not great, which
is one reason why the result of the Hosmer and Lemeshow test was statistically significant. Recall that a chi-squared test
will be statistically significant when the observed frequencies deviate from the expected frequencies (watch out that
statistical significance isn't being influenced by sample size alone). The p < .001 in this case suggests that the observed
frequencies deviate from the expected, a result we do not wish to see. What we would like to see here is a match between both
frequencies, because that would suggest the predicted probabilities are lined up with the observed probabilities, which would
suggest a well-fitting model. In sum, the Hosmer and Lemeshow Test in this case suggests the model fit isn't as good as we
would like.
Next, SPSS reports one of the more important parts of the logistic regression output - the classification table based on
using all predictors in the model. Recall the earlier classification table that used only the constant in the model. The
percentage correctly classified was equal to 54.2, and recall that it did a very poor job of classifying those who committed
to hospital (i.e., "Committed to Hospital = Yes"). However, it was more or less expected that the model would do a rather
poor job at classification, since we hadn't entered our predictors yet. But now we have them in the model, so let's have a
look at the improvement in classification:

What do you think? How does the classification look now? As can be seen from the classification table, the overall percentage
of those correctly classified is equal to 78.0%, a big increase from the constant-only model correctly classifying 54.2%. We
can inspect the table for some specifics now: of those who do not commit to hospital, the model correctly classifies 174 out
of 192 cases (the marginal total for "no"), or 90.6%. Where did I get the 192 from? It's from adding 174 to 18, the total
"row" of "Committed to Hospital No." Now, how does the model do at predicting those who do commit to hospital? It correctly
classifies 102 out of 162, or 102/162 = 63% of cases. How was the "Overall Percentage" of 78.0 figure calculated? It was
calculated by summing up the "hits" (174 and 102) and dividing over the total number of cases. The "hits" consist of the sum
of "No-No" and "Yes-Yes" frequencies. In the above table, there are 174 hits for "No-No," and 102 hits for "Yes-Yes," giving
us 174 + 102 = 276 hits. When we divide that by the total frequency of 354, the resulting percentage correctly classified is
276/354 = 78.0% (it's actually 77.97%, but SPSS is rounding up). The note at the bottom of the table regarding the cut-off
value being .500 simply states that when the classification table was produced by SPSS, it needed a way to group the
resulting frequencies in the correct cells. For instance, of those who did not commit to hospital, if the predicted
probability for a given case was equal to say, .48, that case was classified under "Committed, No." Similarly, if the
predicted probability for a given case was equal to say, .55, that case was classified under "Committed, Yes." The cut value
of .500 is typical for this type of analysis, but you should also know that it could be changed if we had good theoretical
reason for changing it. So in sum, our model with 3 predictors correctly classifies 78.0% of cases.

Next up we have "Variables in the Equation," and this time, we see that all three predictors are included. The above table is
a major part of the logistic regression output, and in addition to interpreting previous fit statistics, you will need to
know how to interpret the above table very well to assess the evidence in your data. Let's take a close look at what
everything means. First, we see that SPSS is reporting the Step number, which is "1," and noting that the variables in this
step include education, minority and stress. Let's look at the coefficient "B" under "education." It is equal to -.187. What
does it mean? It means that given an increase of education by one unit, we can expect the log odds (or "logit") of committing
to hospital to decrease by -.187, controlling for minority and stress. Now what does that mean? Recall that when performing a
logistic regression, the actual thing we are modeling is the log of the odds, called the "logit." The logit is equal to the
natural log of the odds. Interpreting something as the log of the odds isn't terribly enlightening, which is why we will
convert the logit to a probability shortly, and interpret the probability figure instead. However, what is key right now is
to realize that the logit is on a continuous scale, and not a categorical one. The logit can range from negative infinity to
positive infinity, which means that the coefficient -.187 represents the expected change in a continuous quantity (the
logit), and not a dichotomous response variable. The dichotomous response variable of "yes vs. no hospital" was converted to
a continuous logit, which made the logistic regression possible. Again, we'll convert the logit to a probability shortly, but
for now, it is enough to note that the coefficients under "B" represent changes in logits, and logits can be any "slice" of
values on the real number line. An easy way to conceptualize this is that in OLS regression, typically the response variable
is measured on a continuous scale. In logistic regression, we transform the categorical information into a continuous scale.
Whatistheunitofmeasurement?It'sthelogit,thelogoftheodds.

Next, we see "S.E." which stands for "standard error." The standard error of an estimated coefficient should be familiar from
previous work on linear regression and even t-tests and z-tests. Essentially, the standard error is a measure of how stable
our estimate is. A large standard error means the estimated coefficient isn't that well estimated, and a low standard error
means we have a fairly precise estimate. For "education," the standard error is equal to .056. The Wald statistic, as noted
earlier, is very much like a t-statistic conceptually, and is a test of the null hypothesis that the "B" population
coefficient is equal to 0. Do we have good reason to reject the null hypothesis? Based on the p-value of .001, we have
evidence to suggest that the "B" coefficient is not equal to 0 in the population from which these data were presumably drawn.
That is, we have evidence to suggest that "education," after controlling for "minority" and "stress," predicts the response
variable better than chance alone. Note carefully that the coefficient of -.187 can only be interpreted in the context of the
inclusion of minority and stress. Similar to OLS regression, you cannot interpret the bivariate relationship between a
predictor and a response variable if you have additional variables in the model. Quite literally, in the above model, we are
not testing the relationship between any single predictor and the response variable. If you wanted to do this, then you'd
havetotesteachtermseperatelyinitsownbivariatemodel.

Next, we see "Exp(B)," and for education, the value is equal to .829. The value is computed by exponentiating the logit. We
take the value e, which is a constant equal to approximately 2.71, and use "B" as the exponent. So, we have e^(-.187), equal
to .829. The number ".829" has a very special meaning. It is called an "odds" and is interpreted as follows: an increase of 1
unit on education increases the odds of committing to hospital by .829, controlling for minority and stress. The 95%
confidence interval is also provided for the value of Exp(B). We interpret the 95% confidence interval as follows: in 95% of
samples drawn from this population, we can expect the interval from .743 to .926 to include the true parameter of Exp(B). In
5% of samples drawn from this population, the true parameter will not lie between the lower and upper limits of .743 and .926
respectively. Note carefully that it is the interval that is the random element in all this, not the actual parameter. The
parameter we assume to be fixed.

The "Exp(B)" value is not very intuitive, unless you're very familiar with interpreting odds. It is usually much more natural
to interpret probabilities (which have a range of 0 to 1). Luckily, we can convert the odds to probabilities, and have SPSS
produce these for us. Recall that under the "Logistic Regression: Save" window, we checked off "Predicted Values -
Probabilities". SPSS computes the predicted probabilities in the main SPSS Editor:

For case number 1 (highlighted in green), with an education score of 16, minority equal to 0, stress score of 3.06, and the
response variable score of 1, the predicted probability is equal to .98377. How was this obtained? It was obtained by a
simple transformation of the odds. First, let's compute the logistic regression equation for case 1 using the coefficients
derived from the above model:

logit (hospitalization) = -3.985 + 3.626(stress) + .915(minority) + (-.187)(education)


=-3.985 + 3.626(3.06) + .915(0) + (-.187)(16)
=-3.985 + 11.09556 + 0 - 2.992
=4.11856

The value of 4.11856 is NOT a probability, and therefore we should not expect it to agree with the value of PRE_1 above
of .98377. The value of 4.11856 is a LOGIT (the natural log of the odds). To transform the logit to a probability, we first
exponentiate the logit: e^4.11856 = 61.471. The number 61.471 is called the "odds." To get the probability, convert the odds
to a probability by the following transformation:

probability = odds/(1 + odds) = 61.471(1 + 61.471) = 0.984

Notice that the value of 0.984 agrees with the value reported by SPSS under "PRE_1" which stands for "predicted probability."
Let's do case 2:

logit (hospitalization) = -3.985 + 3.626(stress) + .915(minority) + (-.187)(education)


=-3.985 + 3.626(1.67) + .915(1) + (-.187)(16)
=-3.985 + 6.05542 + .915 - 2.992
=-0.00658

e^-0.00658 = 0.993
probability = odds/(1 + odds) = 0.993(1+0.993) = 0.498

The value of 0.498, within rounding error, agrees with the value reported by SPSS for case 2. It is the predicted probability
of being hospitalized for this particular case, given the inclusion of other variables in the model. This is how the actual
predicted probabilities in a logistic regression are computed, and we could continue the process for each case in our
analysis. Recall that the objective of a statistical model is to reproduce the observed data by estimating the parameters you
specify to be estimated. Whether it be a glm, or binary logistic, we want to reproduce the observed data the best we can, and
the extent to which a model is able to reproduce observed data, the model is regarded as well-fitting.

The next logit is:

logit (hospitalization) = -3.985 + 3.626(stress) + .915(minority) + (-.187)(education)


=-3.985 + 3.626(1) + .915(1) + (-.187)(15)
=-3.985 + 3.626 + .915 - 2.805
=-2.249

e^-0.00658 = 0.1055
probability = odds/(1 + odds) = 0.1055/(1+ 0.1055) = 0.095

Let's command SPSS to compute the logits for each observation:

COMPUTE logit = -3.985-.187*educ +.915*minority +3.626*stress.


EXECUTE.
The logits for the first 10 observations follows:

4.11
-.02
-2.25
-.14
.26
-.64
.38
1.37
-1.82
-.15

How do we go from logits to probabilities? As indicated above, we exponentiate the logit, then use this exponentiation to
calculate the probability by odds/(1 + odds). Let's command SPSS to do this:

COMPUTE prob = (2.71**logit)/(1+2.71**logit). [where ** indicates exponentiation, and 2.71 is approximately equal to e]
EXECUTE.

.9836
.4953
.0960
.4646
.5645
.3462
.5938
.7965
.1404
.4616

Notice that within rounding error, we end up with the same probabilities listed in the above SPSS output window.

Linearity of the Logit

[section forthcoming]

Matrix of Predictors

SPSS then provides the requested correlation matrix of predictors that we specified in the window commands:

There isn't much to look at in this matrix, except to note that there are no bivariate correlations that are extremely high
(e.g., .8 or .9 and higher). This fact alleviates any concerns about multicollinearity. SPSS then reports a rather important
table:
To be continued . . .

References & Readings

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Lawrence Erlbaum Associates: New
Jersey.

Meyers, L.S., Gamst, G. & Guarino, A. J. (2006). Applied multivariate research: design and interpretation. Sage Publications: London.

DATA & DECISION, Copyright 2010, Daniel J. Denis, Ph.D. Department of Psychology, University of Montana. Contact Daniel J. Denis by e-mail daniel.denis@umontana.edu.

Das könnte Ihnen auch gefallen