Beruflich Dokumente
Kultur Dokumente
qxd
5/27/2005
10:22 AM
Page 43
CHAPTER
3A
Data Screening
nce your datawhether derived from survey, experimental, or archival methodsare in hand and have been entered into SPSS, you must resist the temptation to plunge ahead with sophisticated multivariate statistical analyses without first critically examining the quality of the data you have collected. This chapter will focus on some of the most salient issues facing researchers before they embark on their multivariate journey. Just like a summer vacationer proceeds in a purposeful way by stopping the mail and newspaper for a period of time, confirming reservations, and checking to see that sufficient funds are available to pay for all the fun, so too must the researcher take equally crucial precautions before proceeding on with the data analysis. Some of these statistical considerations and precautions take the following form:
Do the data accurately reflect the responses made by the participants of my study? Are all the data in place and accounted for, or are some of the data absent or missing? Is there a pattern to the missing data? Are there any unusual or extreme responses present in the data set that may distort my understanding of the phenomena under study? Do these data meet the statistical assumptions that underlie the multivariate technique I will be using? What can I do if some of the statistical assumptions turn out to be violated?
43
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 44
44
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 45
Data Screening
45
about 55 with minimum and maximum scores of approximately 35 and 65, respectively. If during the cleaning process we discover a respondent with a GAF score of 2, a logically legitimate but certainly an unusual score, we would probably want to verify its authenticity through other sources. For example, we might want to take a look at the original questionnaire or the actual computer-based archival record for that individual. Such a cleaning process leads us to several options for future action. Consider the situation in which we find a very low GAF score for an individual. Under one scenario, we may confirm that the value of 2 is correct and leave it alone for the time being. Or after confirming its veridicality, we might consider that data point to be a candidate for elimination on the proposition that it is an outlier or extreme score because we view the case as not being representative of the target population under study. On the other hand, if our investigation shows the recorded value to be incorrect, we would substitute the correct value (e.g., 42) in its stead. Last, if we deem the value to be wrong but do not have an appropriate replacement, we can treat it as a missing value (by either coding directly in SPSS the value of 2 as missing or, as is more frequently done, by replacing it with a value that we have already specified in SPSS to stand for a missing value).
Distribution Diagnosis
With small data sets containing a few cases, data cleaning can be accomplished by a simple visual inspection process. However, with the typically large data sets required for most multivariate analyses, using computerized computational packages such as SPSS before you start your statistical analysis provides a more efficient means for screening data. These procedures provide output that display the way in which the data are distributed. We will discuss six types of output commonly used for this purpose: frequency tables, histograms and bar graphs, bar stem-and-leaf displays, box plots, and scatterplot matrices. In Chapter 3B, we will demonstrate how to use these procedures to conduct these analyses with SPSS.
Frequency Tables
A frequency table is a convenient way to summarize the obtained values for variables that contain a small number of different values or attributes. Demographic variables such as gender (with two codes), ethnicity (with between half a dozen and a dozen codes), and marital status (usually with no more than about six categories), along with other nominal level variables, including questions that require simple, dichotomous yes no options, all
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 46
46
Terminal Degree Status of Fifty-One Community Mental Health Center Providers Degree MA MA HS MA HS DOC MA DOC MA MA MA HS DOC MA DOC MA HS DOC MA HS HS DOC DOC MA MA Respondent 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 Degree HS MA MA HS MA BA MA BA MA BA MA MA DOC MA BA MA MA DOC OTH MA OTH MA DOC HS MA
SOURCE: Gamst, Dana, Der-Karabetian, and Kramer (2001). NOTE: MA = masters degree; HS = high school graduate; DOC = doctorate (PhD or PsyD); BA = bachelors degree, OTH = other.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 47
Data Screening
47
Frequency Table Showing One Out-of-Range Value Terminal Degree High school Bachelor Masters Doctorate Other Total n 9 4 25 10 2 1 51 Percentage 18 8 50 20 4 1 100
a relatively small space, we have structured this table to display the data in two columns. A cursory inspection of the information contained in Table 3a.1 suggests a jumble of degree statuses scattered among 50 of the 51 mental health practitioners. A coherent pattern is hard to discriminate. A better and more readable way to display these data appears as the frequency distribution or frequency table that can be seen in Table 3a.2. Each row of the table is reserved for a particular value of the variable called terminal degree. It aggregates the number of cases with a given value and their percentage of representation in the data array. A frequency table such as that illustrated in Table 3a.2 enables researchers to quickly decipher the important information contained within a distribution of values. For example, we can see that 50% of the mental health providers had masters degrees and 20% had doctorates. Thus, a simple summary statement of this table might note that, 70% of the mental health providers had graduate-level degrees. Table 3a.2 also shows how useful a frequency table can be in the data cleaning process. Because the researchers were using only code values 1 through 5, the value of 6 (coded for Respondent 16) represents an anomalous code in that it should not exist at all. It is for that reason that the value of 6 has no label in the table. In all likelihood, this represents a data entry error. To discover which case has this anomalous value if the data file was very large, one could have SPSS list the case number and the terminal degree variable for everyone (or, to make the output easier to read, we could first select for codes on this variable that were greater than 5 and then do the listing).
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 48
48
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 49
Data Screening
49
Table 3a.3
An Abbreviated Frequency Table of Global Assessment of Function (GAF) Scores for Asian American Community Mental Health Consumers n 4 1 1 2 3 1 3 4 5 2 6 5 7 6 8 10 9 5 10 8 9 13 12 12 11 8 7 8 5 3 4 3 3 1 1 200 Percentage 2.0 0.5 0.5 1.0 1.5 0.5 1.5 2.0 2.5 1.0 3.0 2.5 3.5 3.0 4.0 5.0 4.5 2.5 5.0 4.0 4.5 6.5 6.0 6.0 5.5 4.0 3.5 4.0 2.5 1.5 2.0 1.5 1.5 0.5 0.5 100.0
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 50
50
Frequency
30
20
10
0 0 10 20 30 40 50 60 70 80
gafs
Figure 3a.1 Histogram Showing the Frequency Count of the GAF Scores
comfortable with a conservative threshold of 0.5 as indicative of departures from normality (e.g., Hair et al., 1998; Runyon, Coleman, & Pittenger, 2000), whereas others prefer a more liberal interpretation of 1.00 for skewness, kurtosis, or both (e.g., George & Mallery, 2003; Morgan, Griego, & Gloeckner, 2001). Tabachnick and Fidell (2001a) suggest a more definitive assessment strategy for detecting normality violations by dividing a skewness or kurtosis value by its respective standard error and evaluating this coefficient with a standard normal table of values (z scores). But such an approach has its own pitfalls as Tabachnick and Fidell (2001a) note: But if the sample is large, it is better to inspect the shape of the distribution instead of using formal inference because the equations for standard error of both skewness and kurtosis contain N, and normality is likely to be rejected with large samples even when the deviation is slight. (p. 44)
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 51
Data Screening
51
Another helpful heuristic, at least regarding skewness, comes from the SPSS help menu, which suggests that any skewness value more than twice its standard error is taken to indicate a departure from symmetry. Unfortunately, no such heuristics are provided for determining normality violations due to extreme kurtosis. The shape of the distribution becomes of interest when researchers are evaluating their data against the assumptions of the statistical procedures they are considering using.
Stem-and-Leaf Plots
Figure 3a.2 provides the next of kin to a histogram display called a stem-and-leaf plot. This display, introduced by the statistician John Tukey (1977) in Exploratory Data Analysis, represents hypothetical GAF scores that might have been found for a sample of individuals arriving for their first session for mental health counseling. Stem-and-leaf plots provide information about the frequency of a quantitative variables values by incorporating the actual values of the distribution. These plots are composed of three main components. On the far left side of Figure 3a.2 is the frequency with which a particular value (the one shown for that row) occurred. In the center of the figure
Frequency
1.00 3.00 3.00 7.00 9.00 20.00 26.00 29.00 28.00 28.00 10.00 9.00 2.00 3.00 Stem width: Each leaf:
10 1 case(s)
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 52
52
Box plots
Box plots or box and whiskers plots were also introduced by Tukey (1977) to help researchers identify extreme scores. Extreme scores can adversely affect many of the statistics one would ordinarily compute in the course of performing routine statistical analyses. For example, the mean of 2, 4, 5, 6, and 64 is 16. The presence of the extreme score of 64 has resulted in a measure of central tendency that does not really represent the majority of the scores. In this case, the median value of 5 is a more representative value of the central tendency of this small distribution. As we will discuss at length later, extreme scores, known as outliers, are often removed or somehow replaced by more acceptable values. Box plots convey a considerable amount of information about the distribution in one fairly condensed display, and it is well worth mastering the terminology associated with box plots so that they become a part of your data screening arsenal. An excellent description of this topic is provided by Cohen (1996), and what we present here is heavily drawn from his treatment.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 53
Data Screening
Outliers Upper Inner Fence
53
Upper Adjacent Value Upper Whisker 75th Percentile (Third Quartile), also called Upper Tukeys Hinge.
Median 25th Percentile (First Quartile), also called Lower Tukeys Hinge.
Outliers
Figure 3a.3 The General Form of a Box and Whiskers Plot Based on Cohens (1996) Description
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 54
54
Scatterplot Matrices
We have been discussing various data cleaning and data screening devices that are used to assess one variable at a time; that is, we have assessed
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 55
Data Screening
55
80
70
60
50
40
30
4 3 2 1
20
10 hgaf
variables in a univariate manner. Because of the complex nature of the statistical analyses we will be employing throughout this book, it is also incumbent on us to screen variables in a bivariate and multivariate mannerthat is, to examine the interrelationship of two or more variables for unusual patterns of variability in combination with each other. For example, we can ask about the relationship between GAF intake and GAF termination or between years lived in the United States and GAF intake. These sorts of questions can be routinely addressed with a scatterplot matrix of continuous variables. We show an example of a scatterplot matrix in Figure 3a.5. In this case, we used four variables and obtained scatterplots for each combination. For ease of viewing, we present only the upper half of the matrix. Each entry represents the scatterplot of two variables. For example, the left-most plot on the first row shows the relationship of Variables A and B. It would appear, from the plot, that they might be related in a curvilinear rather than a linear manner. On the other hand, B and C seem to be related linearly. As we will see later in this chapter, these plots are often used to look for multivariate assumption violations of normality and linearity. An alternative approach to addressing linearity with SPSS, which we will not cover here, is to use the regression curve estimation procedure.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 56
56
Variable A
Variable B
Variable C
100
Variable D
B
50 40 30 20 10 0 0 100 200 300 400 500 500 400 300 200 100 0 100 0 100 0 100
C
50 40 30 20 10 0 200 300 500 400 300 200 100 0 100 200 300 300
30 20 10 0 100
200
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 57
Data Screening
57
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 58
58
NOTE: GAF T1 = GAF at intake; GAF T2 = GAF at Time 2, Sessions = number of treatment sessions.
four data points. For example, the third case was missing a second GAF score and a record of the number of therapy sessions he or she attended. Thus, these two missing values made up 50% of the total of four data points that should have been there. A quick inspection of Table 3a.4 indicates that missing data are scattered across all four of the variables under study. We can also note that 9 of the 15 cases (60%) have no missing data. Because almost all the multivariate procedures we talk about in this book need to be run on a complete data set, the defaults set by the statistical program would select only these 9 cases with full data, excluding all cases with a missing value on one or more variables. This issue of sample size reduction as a function of missing data, especially when this reduction appears to be nonrandom, can threaten the external validity of the research. Further inspection of Table 3a.4 shows that one third of the sample (5 respondents) had missing data on the GAF-T2 (the 6-month postintake measurement) variable. Normally, such a relatively large proportion of nonresponse for one variable would nominate it as a possible candidate
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 59
Data Screening
59
for deletion (by not including that variable in the analysis, more cases would have a complete data set and thus contribute to the analysis). Nonresponse to a postmeasure on a longitudinal (multiply measured) variable is a fairly common occurrence. If this variable is crucial to our analyses, then some form of item imputation (i.e., estimation of what the missing value might have been and replacement of the missing value with that estimate) procedure may be in order. We should also be attentive to how much missing data are associated with each measure. As a general rule, variables containing missing data on 5% or fewer of the cases can be ignored (Tabachnick & Fidell, 2001b). The second GAF variable, showing a third of its values as missing, has a missing value rate substantially greater than this general rule of thumb; this variable therefore needs a closer look. In the present example, although the remaining three variables exceed this 5% mark as well, the relative frequency of cases with missing data is small enough to ignore. An important consideration at this juncture, then, is the randomness of the missing data pattern for the GAF-T2 variable. A closer inspection of the missing values for GAF-T2 indicates that they tend to occur among the older respondents. This is probably an indication of a systematic pattern of nonresponse and thus is probably not ignorable. Finally, we can see from Table 3a.4 that Case 11 is missing data on three of the four variables. Missing such a large proportion of data points would make a strong argument for that case to be deleted from the data analysis.
Listwise Deletion
This method involves deleting from the particular statistical analysis all cases that have missing data. We call this method listwise because we are
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 60
60
Pairwise Deletion
This approach computes summary statistics (e.g., means, standard deviations, correlations) from all available cases that have valid values (it is the SPSS default method of handling missing values for these computations in most of the procedures designed to produce descriptive statistics). Thus, no cases are necessarily completely excluded from the data analysis. Cases with missing values on certain variables would still be included when other variables (on which they had valid values) were involved in the analysis. If we wanted to compute the mean for variables X, Y, and Z using, for example, the Frequencies procedure in SPSS, all cases with valid values on X would be brought into the calculation of Xs mean, all cases with valid values on Y would be used for the calculation of Ys mean, and all cases with valid values of Z would be used for the calculation of Zs mean. It is therefore possible that the three means could very well be based on somewhat different cases and somewhat different Ns.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 61
Data Screening
61
Correlation presents another instance where pairwise deletion is the default. To be included in computing the correlation of X and Y, cases must have valid values on both variables. Assume that Case 73 is missing a value on the Y variable. That case is therefore excluded in the calculation of the correlations between X and Y and between Y and Z. But that case will be included in computing the correlation between X and Z if that case has valid values for those variables. It is therefore not unusual for correlations produced by the Correlations procedure in SPSS to be based on somewhat different cases. Note that when correlations are computed in one of the multivariate procedures where listwise deletion is in effect, that method rather than pairwise deletion is used so that the correlations in the resulting correlation matrix are based on exactly the same cases. Although pairwise deletion can be successfully used with linear regression, factor analysis, and structural equation modeling (see Allison, 2002), this method is clearly most reliable when the data are MCAR. Furthermore, statistical software algorithms for computing standard errors with pairwise deletion show a considerable degree of variability and even bias (Allison, 2002). Our recommendation is not to use pairwise deletion when conducting multiple regression, factor analysis, or structural equation modeling.
Imputation Procedures
The next approaches to missing data that we describe are collectively referred to as imputation procedures. These methods attempt to impute or substitute for a missing value some other value that they deem to be a reasonable guess or estimate. The statistical analysis is then conducted using these imputed values. Although these imputation methods do preserve sample size, we urge caution in the use of these somewhat intoxicating remedies for missing data situations. Permanently altering the raw data in a data set can have potentially catastrophic consequences for the beginning and experienced multivariate researcher alike. For good overviews of these procedures, see Allison (2002), Hair et al. (1998), and Tabachnick and Fidell (2001b).
Mean Substitution
Mean substitution calls for replacing all missing values of a variable with the mean of that variable. The mean to be used as the replacement value, of course, must be based on all the valid cases in the data file. This is both the most common and most conservative of the imputation practices.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 62
62
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 63
Data Screening
63
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 64
64
Beyond EM Imputation
A more recently developed approach to estimating missing values is a multiple imputation (MI) procedure (see Allison, 2002; Graham et al., 2003; Schaefer & Graham, 2002; West, 2005). Although at the time we are writing our book, this MI procedure has been implemented by SAS, SPSS has not yet incorporated it. MI continues where EM stops. Very briefly stated, EM generates an estimate of the missing value. Recognizing that this is not a precise estimate but, rather, has some estimation error associated with it, MI adds to or subtracts from the EM estimate the value of a randomly chosen residual from the regression analysis, thus building some error variance into the estimated value. MI cycles through the prediction and residual selection process in an iterative manner until convergence on estimated value of the parameters of the model are reached. This whole process is usually performed between 5 and 10 times (West, 2005), generating many separate parameter estimates. The final values of the estimated parameters of interest are then computed based on the information from these various estimates. Because we are attempting to best predict the missing values, this approach allows us to include in this effort variables that are not necessarily related to the research question but that might enhance our prediction of the missing values themselves provided that such variables were thoughtfully included in the original data collection process.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 65
Data Screening
65
Recommendations
We agree with both the sage and tongue-in-cheek advice of Allison (2002) that the only really good solution to the missing data problem is not to have any (p. 2). Because the likelihood of such an eventuality is low, we encourage you to explore your missing values situation. As a first step, you could compare cases with and without missing values on variables of interest using independent samples t tests. For example, cases with missing gender data could be coded as 1 and cases with complete gender data could be coded as 0. Then you could check to see if any statistically significant differences emerge for this dummy coded independent variable on a dependent variable of choice such as respondents GAF score. Such an analysis may give you confidence that the missing values are or are not related to a given variable under study. If you elect to use some form of missing value imputation process, it is worthwhile to compare your statistical analysis with cases using only complete data (Tabachnick & Fidell, 2001b). If no differences emerge between complete versus imputed data sets, then you can have confidence that your missing value interventions reflect statistical reality. If they are different, then further exploration is in order. We recommend the use of listwise case deletion when you have small numbers of missing values that are MCAR or MAR. Deleting variables that contain high proportions of missing data can also be desirable if those variables are not crucial to your study. Mean substitution and regression imputation procedures can also be profitably employed when missing values are proportionately small (Tabachnick & Fidell, 2001b) but we recommend careful pre-post data set appraisal as outlined above as well as consultation with a statistician as you feel necessary. If you have the SPSS Missing Values Analysis, then we suggest using that module to handle your missing values; it will do an adequate job of estimating replacement values. Finally, if you have access to an MI procedure (e.g., SAS) and feel comfortable enough working with it and subsequently importing your data to SPSS to perform the statistical analyses, then we certainly recommend using that procedure.
Outliers
Cases with extreme or unusual values on a single variable (univariate) or on a combination of variables (multivariate) are called outliers. These outliers provide researchers with a mixed opportunity in that their existence may signal a serendipitous presence of new and exciting patterns within a data set, yet they may also signal anomalies within the data that may need to be
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 66
66
Causes of Outliers
Hair et al. (1998) identify four reasons for outliers in a data set. 1. Outliers can be caused by data entry errors or improper attribute coding. These errors are normally caught in the data cleaning stage. 2. Some outliers may be a function of extraordinary events or unusual circumstances. For example, in a human memory experiment, a participant may recall all 80 of the stimulus items correctly or illness strikes a participant during the middle of a clinical interview, changing the nature of her responses when she returns the following week to finish the interview. Most of the time, the safest course is to eliminate outliers produced by these circumstancesbut not always. The fundamental question you should ask yourself is, Does this outlier represent my sample? If yes, then you should include it. 3. There are some outliers for which we have no explanation. These unexplainable outliers are good candidates for deletion. 4. There are multivariate outliers whose uniqueness occurs in their pattern of combination of values on several variables, for example, unusual combined patterns of age, gender, and number of arrests.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 67
Data Screening
67
(2003) provide this tip on outliers stating that if outliers are few (less than 1% or 2% of n) and not very extreme, they are probably best left alone (p. 128). An alternative approach to univariate detection of outliers involves inspecting histograms, box plots, and normal probability plots (Tabachnick & Fidell, 2001b). Univariate outliers reveal themselves through their visible separation from the bulk of the cases on a particular variable when profiled with these graphical techniques.
Normality
The shape of a distribution of continuous variables in a multivariate analysis should correspond to a (univariate) normal distribution. That is, the variables frequency distribution of values should roughly approximate a bell-shaped curve. Both Stevens (2002) and Tabachnick and Fidell (2001b) indicate that univariate normality violations can be assessed with statistical or graphical approaches.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 68
68
Graphical Approaches
Graphical approaches that assess univariate normality typically begin with an inspection of histograms or stem-and-leaf plots for each variable. However, such cursory depictions do not provide a definitive indication of a normality violation. A more precise graphical method is to use a normal probability plot, where the values of a variable are rank ordered and plotted against expected normal distribution values (Stevens, 2002). In these plots, a normal distribution produces a straight diagonal line, and the plotted data values are compared with this diagonal. Normality is assumed if the data values follow the diagonal line.
Multivariate Approaches
So far we have been discussing the assumption of univariate normality. The assumption of multivariate normality, although somewhat more complicated, is intimately related to its univariate counterpart. Stevens (2002)
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 69
Data Screening
69
cautions researchers that just because one has demonstrated univariate normality on each variable in a data set, the issue of multivariate normality the observations among all combinations of variables are normally distributedmay not always be satisfied. As Stevens (2002) notes, Although it is difficult to completely characterize multivariate normality, normality on each of the variables separately is a necessary, but not sufficient, condition for multivariate normality to hold (p. 262). Thus, although univariate normality is an essential ingredient to achieve multivariate normality, Stevens argues that two other conditions must also be met: (a) that linear combinations of the variables (e.g., variates) should be normally distributed and (b) that all pairwise combinations of variables should also be normally distributed. As we noted previously, SPSS offers a procedure to easily examine whether or not univariate normality is present among the variables with various statistical tests and graphical options. But it does not offer a statistical test for multivariate normality. We therefore recommend a thorough univariate normality examination coupled with a bivariate scatterplot examination of key pairs of variables. If the normality assumption appears to be violated, it may be possible to repair this problem through a data transformation process.
Linearity
Many of the multivariate techniques we cover in this text (e.g., multiple regression, multivariate analysis of variance [MANOVA], factor analysis) assume that the variables in the analysis are related to each other in a linear manner; that is, they assume that the best fitting function representing the scatterplot is a straight line. Based on this assumption, these procedures often compute the Pearson correlation coefficient (or a variant of it) as part of the calculations needed for the multivariate statistical analysis. As we will discuss in Chapter 4A, the Pearson r assesses the degree of linear relationship observed between two variables. Nonlinear relationships between two variables cannot be assessed by the Pearson correlation coefficient. To the extent that such nonlinearity is present, the observed Pearson r would be a less representative index of the strength of the association between the two variablesit would identify less relationship strength than existed because it could capture only the linear component of the relationship. The use of bivariate scatterplots is the most typical way of assessing linearity between two variables. Variables that are both normally distributed and linearly related to each other will produce scatterplots that are oval shaped or elliptical. If one of the variables is not normally distributed, linearity will not be achieved. We can recognize this situation because the
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 70
70
Homoscedasticity
The assumption of homoscedasticity suggests that quantitative dependent variables have equal levels of variability across a range of (either continuous or categorical) independent variables (Hair et al., 1998). Violation of this assumption results in heteroscedasticity. Heteroscedasticity typically occurs when a variable is not distributed in a normal manner or when a data transformation procedure has produced an unanticipated distribution for a variable (Tabachnick & Fidell, 2001b). In the univariate analysis of variance (ANOVA) context (with one quantitative dependent variable and one or more categorical independent variables), this homoscedasticity assumption is referred to as homogeneity of variance in which it is assumed that equal variances of the dependent measure are observed across the levels of the independent variables (Keppel, 1991; Keppel, Saufley, & Tokunaga, 1992). Several statistical tests can be used to detect homogeneity of variance violations, including Fmax and Levenes test. The Fmax test is computed by working with the variance of each group and dividing the largest variance by the smallest variance. Keppel et al. (1992) note that any Fmax value of 3.0 or greater is indicative of an assumption violation, and they recommend the use of the more stringent alpha level of p < .025 when evaluating an F ratio. Alternatively, Levenes test assesses the statistical hypothesis of equal variances across the levels of the independent variable. Rejection of the null hypothesis (at p < .05) indicates an assumption violation or unequal variability. Stevens (2002) cautions about the use of the Fmax test because of its extreme sensitivity to violations of normality. The Fmax statistic can be produced through the ANOVA procedure, and the Levene test can be produced with the SPSS Explore and Oneway procedures.
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 71
Data Screening
71
When more than one quantitative dependent variable is being assessed (as in the case of MANOVA) then Boxs M test for equality of variancecovariance matrices is used to test for homoscedasticity. Akin to its univariate counterpart, Levenes test, Boxs M tests the statistical hypothesis that the variance-covariance matrices are equal. A statistically significant (p < .05) Boxs M test indicates a homoscedasticity assumption violation, but it is very sensitive to any departures of normality among the variables under scrutiny (Stevens, 2002). Typically, problems related to homoscedasticity violations can be attributed to issues of normality violations for one or more of the variables under scrutiny. Hence, it is probably best to first assess and possibly remediate normality violations before addressing the issue of equal variances or variance-covariance matrices (Hair et al., 1998; Tabachnick & Fidell, 2001b). If heteroscedasticity is present, this too can be remedied by means of data transformations. We address this topic next.
Data Transformations
Data transformations are mathematical procedures that can be used to modify variables that violate the statistical assumptions of normality, linearity, and homoscedasticity, or that have unusual outlier patterns (Hair et al., 1998; Tabachnick & Fidell, 2001b). First you determine the extent to which one or more of these assumptions are violated. Then you decide whether or not the situation calls for a data transformation to correct this matter. If so, then you actually instruct SPSS to change every value of the variable or variables you wish to transform. Once the numbers have been changed in this manner, you would then perform the statistical analysis on these changed or transformed data values. Much of our current understanding of data transformations has been informed by the earlier seminal work of Box and Cox (1964) and Mosteller and Tukey (1977). These data transformations can be easily achieved with SPSS through its Compute procedure (see Chapter 3B). A note of caution should be expressed here. Data transformations are somewhat of a double-edged sword. On the one hand, their use can significantly improve the precision of a multivariate analysis. At the same time, however, using a transformation can pose a formidable data interpretation problem. For example, a logarithmic transformation of a mental health consumers GAF score or number of mental health service sessions will produce numbers quite different from the ordinary raw values we are used to seeing and may therefore pose quite a challenge to the average journal reader to properly interpret (Tabachnick & Fidell, 2001b). Because of this apparent
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 72
72
Table 3a.5
Comparison of Common Data Transformations With Hypothetical GAF Scores Transformation Square Root 1.00 2.24 5.00 7.07 10.00 Reflect & Square Root 10.00 9.80 8.72 7.14 1.00 Reflect & Logarithm 2.00 1.98 1.88 1.71 0.00 Reflect & Inverse .01 .01 .01 .02 1.00
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 73
Data Screening
73
of transformation strategies is employed depending on the perceived severity of the statistical assumption violation. For example, Tabachnick and Fidell (2001b) and Mertler and Vannatta (2001) lobby for a data transformation progression from square root (to correct a moderate violation), to logarithm (for a more substantial violation), and then to inverse square root (to handle a severe violation). In addition, arc sine transformations can be profitably employed with proportional data, and squaring one variable in a nonlinear bivariate relationship can effectively alleviate a nonlinearity problem (Hair et al., 1998).
Recommended Readings
Allison, P D. (2002). Missing data. Thousand Oaks, CA: Sage. . Barnett, V & Lewis, T. (1978). Outliers in statistical data. New York: Wiley. ., Berry, W D. (1993). Understanding regression assumptions. Newbury Park, CA: . Sage. Box, G. E. P & Cox, D. R. (1964). An analysis of transformations. Journal of the ., Royal Statistical Society, 26(Series B), 211243. Duncan, T. E., Duncan, S. C., & Li, F. (1998). A comparison of model- and multiple imputation-based approaches to longitudinal analyses with partial missingness. Structural Equation Modeling, 5, 121. Enders, C. K. (2001). The impact of nonnormality on full information maximum-likelihood estimation for structural equation models with missing data. Psychological Methods, 6, 352370. Enders, C. K. (2001). A primer on maximum likelihood algorithms available for use with missing data. Structural Equation Modeling, 8, 128141. Fox, J. (1991). Regression diagnostics. Thousand Oaks, CA: Sage. Gold, M. S., & Bentler, P M. (2000). Treatments of missing data: A Monte Carlo com. parison of RBHDI, iterative stochastic regression imputation, and expectationmaximization. Structural Equation Modeling, 7, 319355. Roth, P L. (1994). Missing data: A conceptual review from applied psychologists. . Personnel Psychology, 47, 537560. Rousseeuw, P J., & van Zomeren, B. C. (1990). Unmasking multivariate outliers and . leverage points. Journal of the American Statistical Association, 85, 633639. Rubin, D. (1996). Multiple imputation after 18 + years. Journal of the American Statistical Association, 91, 473489. Schafer, J. L., &, Graham, J. W (2002). Missing data: Our view of the state of the art. . Psychological Methods, 7, 147177. Stevens, J. P (1984). Outliers and influential data points in regression analysis. . Psychological Bulletin, 95, 334344. Tukey, J. W (1977). Exploratory data analysis. Reading, MA: Addison-Wesley. .
3A-Meyers-4722.qxd
5/27/2005
10:22 AM
Page 74