Sie sind auf Seite 1von 16

Educational and http://epm.sagepub.

com/ Psychological Measurement

Reliability of Scores from the Eysenck Personality Questionnaire: A Reliability Generalization Study
John C. Caruso, Katie Witkiewitz, Annie Belcourt-Dittloff and Jennifer D. Gottlieb Educational and Psychological Measurement 2001 61: 675 DOI: 10.1177/00131640121971437 The online version of this article can be found at: http://epm.sagepub.com/content/61/4/675

Published by:
http://www.sagepublications.com

Additional services and information for Educational and Psychological Measurement can be found at: Email Alerts: http://epm.sagepub.com/cgi/alerts Subscriptions: http://epm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://epm.sagepub.com/content/61/4/675.refs.html

>> Version of Record - Aug 1, 2001 What is This?

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT CARUSO ET AL.

RELIABILITY OF SCORES FROM THE EYSENCK PERSONALITY QUESTIONNAIRE: A RELIABILITY GENERALIZATION STUDY
JOHN C. CARUSO, KATIE WITKIEWITZ, ANNIE BELCOURT-DITTLOFF, AND JENNIFER D. GOTTLIEB University of Montana

A reliability generalization study was conducted on data from 69 samples found in 44 studies that employed the Psychoticism (P), Extraversion (E), Neuroticism (N), and Lie (L) scales of the Eysenck Personality Questionnaire (EPQ) or EPQ-Revised. The reliability of the scores varied considerably between scales, with P scores tending to have the lowest reliability. Hierarchical regression analyses revealed that a larger standard deviation of scores was associated with higher score reliability for all four EPQ scales. More variability in age was associated with higher score reliability for the P scale and the L scale. Samples composed of students provided scores with higher reliability than those composed of other types of individuals for the P scale. Several other potential predictors (form, language of administration, average score, average age, gender composition, and number of items per scale) were not significantly related to score reliability.

Researchers performing meta-analytic reliability generalization (RG) studies attempt to characterize the reliability of scores on a particular psychological test and to investigate the factors that influence score reliability. Briefly, the methodology consists of collecting score reliability coefficients and other information from existing studies and using various characteristics of each sample or study (such as age or gender composition) to predict score reliability. Such studies are executed to empirically examine the belief that it is not a test per se that has greater or lesser reliability but a particular set of scores derived from the administration of the test to a particular group. Wilkinson and the American Psychological Association (APA) Task Force on Statistical Inference (1999) stated that it is important to remember that a test is not reliable or unreliable. Reliability is a property of the scores on a test for a particular population of examinees (p. 596).
Educational and Psychological Measurement, Vol. 61 No. 4, August 2001 675-689 2001 Sage Publications

675

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

676

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

The RGs that have been conducted have usually found that the reliability of scores does, in fact, vary as a function of participant and study characteristics (e.g., Caruso, 2000; Caruso & Edwards, in press; Vacha-Haase, 1998; Yin & Fan, 2000; but see Viswesvaran & Ones, 2000), supporting the notion that reliability is a property of scores and not tests. Based on this notion, the manuscript submission guidelines for empirical manuscripts submitted to this journal require the reporting of complete information on the reliability of scores when feasible and proscribe the use of phrasing such as the test is reliable (Thompson, 1994). RG studies typically use some form of the general linear model (e.g., regression, ANOVA, or canonical correlation) to examine the relationships between various study characteristics and score reliability. Score reliability coefficients, or some transformation of them (e.g., the standard error of measurement), are employed as criterion variables. Two well-known assumptions of general linear techniques are that the criterion variable(s) be intervally scaled and normally distributed. With regard to the former, classical test theory (e.g., Lord & Novick, 1968) leads to two seemingly contradictory interpretations of score reliability. First, the reliability coefficient is the correlation between parallel observed measurements (X and X) of a given construct. However, it is also equal to the squared correlation between either observed measurement and the true score (Xt). Thus, score reliability coefficients can be reasonably interpreted as either correlations or squared correlations. The importance of this point is that squared correlations are varianceaccounted-for statistics and as such are scaled on an interval level, whereas correlations themselves are not. In a very informative exchange on the value and implementation of RG studies in a recent special issue of this journal, Sawilowsky (2000) chose to interpret reliability coefficients as correlations, whereas Thompson and Vacha-Haase (2000) preferred the squared correlation interpretation. If we adhere to the interpretation of Thompson and Vacha-Haase, we need not transform the score reliability coefficients prior to analysis to ostensibly satisfy the interval level requirement, but if we interpret them as Sawilowsky did, then they should be transformed in some way to more closely approximate the interval level assumption. Although either case seems reasonable due to the dual interpretation of reliability coefficients, the interval level property is difficult to test (cf. Cliff, 1992). Normality, on the other hand, can be examined by computing the skewness and kurtosis of the distributions of reliability coefficients and various transformations of them. In addition to untransformed score reliability coefficients, we will also consider squared score reliability coefficients (due to the issues raised by Sawilowsky, 2000) and the use of Fishers z transformation, which has been shown to adjust for the skewness of the distribution of correlation coefficients (Dunlap, Silver, & Phelps, 1988; Silver & Dunlap,

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

CARUSO ET AL.

677

1987). The precision of these measures of nonnormality is indicated by their standard errors. The EPQ The original EPQ was the result of successive improvements and additions to the Maudsley Personality Inventory (MPI) (H. J. Eysenck & Knapp, 1962) and the Eysenck Personality Inventory (EPI) (H. J. Eysenck & Eysenck, 1964). The MPI was designed to measure two personality characteristics: extraversion (E) and neuroticism (N). High scorers on the E scale are characterized as sociable, exciting, pleasurable, carefree, and aggressive. Low scorers are more withdrawn, serious, moralistic, and tend to enjoy being alone. An individual who scores high on the N scale is more likely to be a worried and moody person. People with high N scores also tend to suffer from emotional and psychosomatic disorders. Someone with a low N score can often be characterized as stable, less emotional, and not very anxious. It was found that the two scales of the MPI were slightly intercorrelated, although they measured theoretically distinct constructs, and they often produced scores with low reliability (H. J. Eysenck & Eysenck, 1994). The EPI was developed in response to these criticisms and also included the Lie (L) scale for assessing response bias. H. J. Eysenck and Eysenck (1975) then developed the EPQ, which incorporated the Psychoticism (P) scale for assessing psychotic personality characteristics. The P scale was designed to measure behavior patterns that might be considered schizoid or psychopathic in the extreme case. An individual with a high score on the P scale may be inclined to exhibit conduct or other behavioral disorders and may lack empathy. In addition, these individuals may be hostile, insensitive, or disengaged from society. Although various researchers occasionally exclude items or employ short forms, the original versions of the full scales include the following number of items: P (25 items), E (21 items), N (23 items), and L (21 items). Despite the widespread use of the questionnaire, several studies have reported that EPQ scores may have undesirable psychometric properties (e.g., Block, 1977; Goh, King, & King, 1982; Helmes, 1980). These studies have reported problems with the factor structure and low reliability of the scores, particularly on the P scale. S.B.G. Eysenck, Eysenck, and Barrett (1985) recognized three major problems with scores on the original P scale: low reliability, low range, and highly skewed distributions. Primarily to remedy the psychometric weaknesses of scores on the P scale, S.B.G. Eysenck et al. (1985) developed a revised version of the EPQ (the EPQ-R). The 94-item EPQ-R includes 27 items on the P scale, 22 items on the E scale, 24 items on the N scale, and 21 items on the L scale. The internal consistency of the scores

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

678

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

in the standardization sample, reported in the EPQ-R manual, ranged from .66 (P scale, male respondents) to .86 (N scale, male and female respondents). The tests authors (H. J. Eysenck & Eysenck, 1994) justify the low reliability of scores on the P scale by stating,
It must be remembered that the P scale taps several different facets (hostility, cruelty, lack of empathy, nonconformism, etc.) which may hold reliabilities lower than would be true of a scale like E, which comprises largely sociability and activity items only. (p. 14)

But the low score reliability nevertheless casts doubt on the meaningful interpretation of the scores. To the extent that the items of the P scale are not unidimensional, it may be the case that two or more subscales would allow for a more meaningful examination of individual differences. In addition, the statement of H. J. Eysenck and Eysenck (1994) implies that low reliability is a property of the P scale and that high reliability is a property of the E scale. Using the methodology of RG, we can begin to elucidate the group or study characteristics that may be related to the lower reliability of scores on the P scale as opposed to attributing low reliability to the P scale categorically and with finality. Furthermore, we will be able to ascertain whether the reliability of scores on the P scale of the EPQ-R is typically greater than that of the EPQ, that is, whether S.B.G. Eysenck et al. (1985) achieved that goal in their revision of the scale. Purposes The present study has three primary purposes. First, we will assess the typical reliability of scores on the P, E, N, and L scales of the EPQ and EPQ-R. Second, we will compare the distributions of score reliability coefficients, and various transformations of them, to examine the appropriateness of parametric statistical analyses. Third, we will examine the relationships between various study and respondent characteristics and score reliability.

Method
Data In December 1999, the American Psychological Associations (1992) PsycINFO database was used to generate a list of empirical journal articles in which the EPQ or EPQ-R were used. At that time, PsycINFO covered 1,471 periodicals from psychology and related fields. Only those articles appearing between 1980 and 1999 were selected. The literature search identified 1,540 empirical journal articles in which Eysenck Personality Questionnaire, EPQ, or EPQ-R appeared as an index term, in the title, or in the abstract.

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

CARUSO ET AL.

679

Of the 1,540 articles, most were excluded from this study. Three hundred and thirteen articles (20%) were published in a language other than English. Seven hundred and sixty seven (50%), a disappointingly high number, did not mention reliability or score reliability whatsoever. Two hundred and forty nine (16%) asserted that the EPQ (or EPQ-R) was a reliable instrument or produced reliable scores but provided no data to support this claim. One hundred and thirty five (9%) reported reliabilities from one of the EPQ manuals or from other data not collected for that study. The pattern of not even mentioning reliability is common but certainly disturbing and may originate from endemic misconceptions that tests per se are reliable (Vacha-Haase, Ness, Nilsson, & Reetz, 1999; Whittington, 1998). The pattern of inducting reliability coefficients from prior studies is often unjustified, although common too, and is disturbing as well (Vacha-Haase, Kogan, & Thompson, 2000). Twenty four (2%) provided reliabilities from the data at hand but in poor form, such as the range of reliability across all scales. Of the remaining 52 studies, 8 reported test-retest reliability estimates, and these were excluded. This left 44 studies presenting usable internal consistency coefficients. These studies are marked with asterisks in the References section, although some are cited elsewhere as well. Data from 69 samples were extracted from the 44 studies. Four of these studies did not employ the L scale, and so the numbers of samples for all analyses presented here are 69 for P, E, and N and 65 for L. Procedure Separate analyses were conducted for P, E, N, and L score reliability. We selected multiple regression as our method to examine the relationship between score reliability and the selected predictor variables. We performed a hierarchical analysis with the number of items administered and the standard deviation of scores entered as predictors in the first block. These variables were entered first because, with a few common assumptions, they are both algebraically related to score reliability. First, the Spearman-Brown prophesy formula presents the relationship between the number of items on a particular scale and the reliability of the scores it produces:
XX * = k XX 1 + ( k 1) XX (1)

where XX is the reliability of the original scores, k is the ratio of the number of items on the new test to items on the original test, and XX * is the predicted reliability of scores on the new test. For example, if a test with 20 items produces scores with a reliability of .70, and 20 additional items are added, k = 2 and XX * = .82.

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

680

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

Second, observed score variability is related to score reliability by the following equalities:
rxx = 1
2 e 2 X

t2
2 X

(2)

2 2 where 2 X is the observed score variance, t is the true score variance, and e is the error variance, with the error and true score variances summing to form the observed score variance. Nunnally (1970, p. 556) suggested that the first equality in Equation 2 could be used to estimate what score reliability would be if observed score variance were larger or smaller than in a given population. For example, if the reliability of a set of scores was .50, with an error variance of 2 and an observed score variance of 4 (.50 = 1 (2/4)), then the estimated score reliability in a more heterogeneous population with an observed score variance of 8 would be 1 (2/8) = .75. Using Equation 2 in this way assumes that the error variance is the same in the two populations or, equivalently, that true score variance has increased by the exact same amount as has observed score variance. Because of these algebraic relationships, we assigned priority to the number of items administered and the standard deviation of scores and consequently entered them in the first block of our regression analyses. The other predictor variables, selected largely based on availability as this is an archival study, entered simultaneously in a second block were the mean score, the mean age of participants, the standard deviation of age, sample type (0 = student, 1 = nonstudent), gender composition (coded as the proportion of subjects who were male), language of administration (0 = English, 1 = non-English), and EPQ form (0 = EPQ, 1 = EPQ-R). Table 1 provides descriptive statistics for the predictor variables.

Results
The first goal of this study was to characterize the reliability of scores on each EPQ scale in terms of central tendency and variability. Table 2 presents the median, mean, standard deviation, and range of score reliabilities for each of the four EPQ scales. As shown, scores on the N and E scales tend to be most reliable, with medians of .83 and .82, respectively. Scores on the P and L scale were less reliable scores with medians of .66 and .77, respectively. Scores on the P scale in particular often had poor reliability with a minimum of .36 and an interquartile range from .55 to .77. The second goal of this study was to compare the distributions of the score reliability coefficients for each scale to the distributions that resulted after two transformations: squaring and Fishers z transformation. The skewness and kurtosis of each type of coefficient for each scale were computed, along with their standard errors, and these are provided in Table 3. The distributions

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

CARUSO ET AL.

681

Table 1 Descriptive Statistics for Predictor Variables Predictor Average age Standard deviation of age Proportion male Number of items Psychoticism Extraversion Neuroticism Lie Mean of scores Psychoticism Extraversion Neuroticism Lie Standard deviation of scores Psychoticism Extraversion Neuroticism Lie Language of administration English (n = 49) Non-English (n = 17) Missing (n = 3) EPQ form Original (n = 38) Revised (n = 31) Sample type Student (n = 31) Not student (n = 38) M 27.89 7.45 0.50 25.72 20.66 22.06 20.08 4.96 12.32 10.12 8.63 3.12 4.26 4.46 3.90 SD 9.73 4.84 0.40 6.38 3.82 4.18 3.36 2.48 3.12 3.39 3.02 1.14 1.01 1.09 0.94 Range 16.51-63.50 0.67-18.60 0.00-1.00 6-32 6-25 6-25 6-23 0.90-11.53 2.80-20.05 2.10-17.44 2.30-16.55 1.20-5.83 1.50-7.35 1.10-6.18 1.50-8.20 71 25 4 55 45 45 55 Percentage

Table 2 Descriptive Statistics for Score Reliability Coefficients EPQ Scale Psychoticism Extraversion Neuroticism Lie Minimum .36 .68 .69 .59 Maximum .91 .93 .97 .88 Median .68 .82 .83 .78 M .66 .82 .83 .77 SD .13 .05 .04 .05

of the three types of coefficients were generally not highly skewed, and, except for the N scale, they were not highly kurtotic. The untransformed score reliability coefficients for the L scale had a statistically significant amount of negative skew, and the Fishers transformations on the N scale had a significant amount of positive skew. Based on this preliminary evidence, it

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

682

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

Table 3 Nonnormality of the Three Operationalizations of Score Reliability Coefficients Skewness EPQ Scale Psychoticism Extraversion Neuroticism Lie xx 2 xx z xx Kurtosis 2 xx z

0.33 (.29) 0.03 (.29) 0.32 (.29) 0.76 (.57) 0.66 (.29) 0.45 (.29) 0.23 (.29) 0.49 (.57) 0.30 (.29) 0.01 (.29) 2.23 (.29) 1.65 (.57) 0.80 (.30) 0.60 (.30) 0.17 (.30) 1.23 (.59)

0.92 (.57) 0.29 (.57) 0.35 (.57) 0.89 (.57) 1.92 (.57) 12.35 (.57) 0.64 (.59) 0.21 (.59)

Note. Standard errors of skewness and kurtosis are provided in parentheses.

appears that using the z transformation is not indicated and that neither the score reliability coefficients themselves nor the squared score reliability coefficients suffer from debilitating nonnormality. We also conducted parallel regression analyses for each operationalization of score reliability (results not shown) and found no differences in substantive interpretation. Untransformed score reliability coefficients were used as criterion variables in the regression analyses presented next. Our third and final goal was to examine the relationships between the predictor variables and score reliability. Table 4 shows the unstandardized and standardized regression weights and the structure coefficients of the predictors for P score reliability. Both sets of predictors made statistically significant contributions: the R2 for Block 1 was .34, F(2, 58) = 14.71, p < .0005, and the additional variance explained by the Block 2 predictors was .18, F(7, 51) = 2.66, p = .02. The adjusted R2 (a better estimate of the population R2) for the model with all predictors entered was .43. The standard deviation of scores was the strongest predictor of score reliability in both models. Based on the unstandardized regression coefficients (the Bs), the following interpretations can be made. As the standard deviation of scores increases by one, the reliability of scores increases by .05 (in Block 1) or .10 (when Block 2 variables were entered). Sample type was also a statistically significant predictor, and, because this variable was coded as 0 (student) and 1 (nonstudent), we can also state that the reliability of scores from student samples was somewhat higher than that from nonstudent samples. Although statistically significant at an level of .05, this effect was modest and is somewhat difficult to interpret due to the variety of sample types making up the nonstudent group. The standard deviation of age was also a statistically significant predictor, with more age variability being associated with higher score reliability. This effect was also modest, but note that both sample type and age variability accounted for significant amounts of variance in score reliability over and above score variability. Table 5 provides the regression weights and structure coefficients for predicting E score reliability. Neither set of predictors made a statistically signif-

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

CARUSO ET AL.

683

Table 4 Regression Analyses for Psychoticism Score Reliability Block 1 Predictor Constant Number of items Standard deviation of scores Constant Number of items Standard deviation of scores Mean of scores Mean age Standard deviation of age Sample type Proportion male Language EPQ form B .416 .003 .052 .481 .001 .098 .022 .005 .015 .070 .040 .028 .055 SEB .057 .003 .015 .098 .003 .030 .015 .002 .005 .033 .032 .029 .030 .162 .468 .025 .874 .435 .344 .578 .276 .127 .098 .215 t 7.36 1.22 3.51 4.88 0.16 3.30 1.53 1.81 3.06 2.15 1.25 0.98 1.84 p rs

<.0005 .229 .762 .001 .974 <.0005 .874 .616 .002 .788 .133 .640 .077 .123 .004 .227 .036 .160 .218 .042 .334 .014 .072 .466

Table 5 Regression Analyses for Extraversion Score Reliability Block 1 Predictor Constant Number of items Standard deviation of scores Constant Number of items Standard deviation of scores Mean of scores Mean age Standard deviation of age Sample type Proportion male Language EPQ form B .786 .002 .015 .857 .001 .023 .006 .002 .003 .003 .009 .011 .001 SEB .037 .002 .008 .060 .009 .009 .004 .001 .003 .017 .016 .016 .014 .123 .298 .096 .464 .389 .386 .309 .033 .071 .094 .006 t 21.33 0.82 1.99 14.17 0.45 2.60 1.59 1.54 1.28 0.19 0.55 0.66 0.05 p <.0005 .417 .052 <.0005 .651 .012 .119 .131 .208 .849 .585 .510 .963 rs .137 .914 .075 .502 .267 .063 .110 .330 .317 .297 .276

icant contribution, F(2, 58) = 2.01, p = .143 for Block 1 and F(7, 51) = 1.39, p = .23 for Block 2. Despite the nonsignificance of the models, the standard deviation of the scores was significant when the Block 2 variables were entered. This effect was also small: With all other predictors in the model, as the standard deviation of scores increases by one, the score reliability increases by only .02. Only Block 1 predictors were statistically significant for N score reliability (see Table 6), with an adjusted R2 of .17: Block 1 F(2, 58) = 7.27, p = .002; Block 2 F(7, 51) = .81, p = .58. The standard deviation of scores was the only significant predictor of N score reliability.

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

684

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

Table 6 Regression Analyses for Neuroticism Score Reliability Block 1 Predictor Constant Number of items Standard deviation of scores Constant Number of items Standard deviation of scores Mean of scores Mean age Standard deviation of age Sample type Proportion male Language EPQ form B .761 .001 .019 .698 .000 .023 .000 .002 .000 .015 .004 .000 .003 SEB .027 .002 .006 .048 .002 .009 .003 .001 .002 .013 .014 .012 .010 .080 .494 .013 .575 .031 .345 .012 .181 .035 .000 .030 t 27.91 0.53 3.25 14.52 0.08 2.64 0.15 1.47 0.05 1.18 0.26 0.00 2.48 p rs

.0005 .601 .525 .002 .989 .0005 .940 .443 .011 .836 .886 .479 .148 .243 .958 .362 .245 .094 .794 .087 .998 .201 .805 .004

Table 7 Regression Analyses for Lie Score Reliability Block 1 Predictor Constant Number of items Standard deviation of scores Constant Number of items Standard deviation of scores Mean of scores Mean age Standard deviation of age Sample type Proportion male Language EPQ form B .613 .003 .023 .657 .000 .026 .002 .002 .005 .027 .004 .003 .015 SEB .036 .002 .007 .060 .003 .009 .003 .001 .002 .020 .015 .0152 .012 .224 .418 .027 .475 .111 .280 .464 .254 .033 .024 .136 t 17.04 1.73 3.23 11.00 0.16 2.96 0.64 1.28 2.22 1.34 0.28 0.20 1.17 p rs

<.0005 .089 .772 .002 .940 <.0005 .877 .686 .005 .835 .524 .491 .208 .060 .031 .345 .185 .146 .779 .094 .844 .298 .249 .173

Table 7 gives the regression weights for predictors of score reliability on the L scale. Only Block 1 predictors explained a statistically significant amount of variance: Block 1 F(2, 55) = 12.93, p < .0005; Block 2 F(7, 48) = .99, p = .45. The adjusted R2 for Block 1 was .30. Again, the standard deviation of scores was the strongest predictor, but an increase of one results in an increase of only about .02 to .03 in score reliability. The standard deviation of age of the sample was also a significant predictor in Block 2: As the variability in the age of the sample increased by one, the reliability of the scores increased by .005.

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

CARUSO ET AL.

685

Discussion
The main finding of the present study is that scores on the E, N, and L scales typically have adequate reliability, whereas scores on the P scale often have poor reliability. One of the reasons for the development of the EPQ-R was the poor reliability of scores on the P scale of the EPQ (S.B.G. Eysenck et al., 1985). Unfortunately, the form of the EPQ employed was not a statistically significant predictor of P score reliability. Thus, it does not appear that the revision resulted in an improvement in this property of scores from the P scale. Furthermore, both the mean and median reliability of scores on the P scale were less than .70, a value that is typically considered the minimum acceptable value for personality questionnaires. It has been noted (H. J. Eysenck & Eysenck, 1994) that the P scale may be less unidimensional than the other scales of the EPQ, but, although this is consistent with the present findings, it does not reduce the difficulties encountered when score reliability is low. Future research attempting to delineate the factor structure of the P items may lead to the development of two or more subscales that are more internally consistent than the sum of all of the P items. As noted, it appears that the distributions of either score reliability coefficients or squared reliability coefficients adequately approximate normality for most EPQ scales. Therefore, depending on ones interpretation of the issues raised by Sawilowsky (2000) and Thompson and Vacha-Haase (2000), either could be appropriately employed as criterion variables in parametric statistical analyses. The use of Fishers z transformation, often employed when analyzing correlation coefficients, resulted in an increase in deviation from normality. Why a transformation that has proved valuable in normalizing distributions of correlations (e.g., Dunlap et al., 1988; Silver & Dunlap, 1987) did not improve conditions here is unclear, but this is an area worthy of further study. It would be valuable for future RG studies to include an examination of nonnormality for the expressions of score reliability employed here and others. The results of this and other RGs as well as basic formulas of classical test theory indicate that observed score variability is a very important predictor of score reliability. In fact, few instances were found in this study in which other predictor variables accounted for variance in score reliability over and above that accounted for by score variability. However, other variables included in our analysis may have an effect on score reliability through increasing observed score variability, a hypothesis that seems quite likely for many of the predictors in our analyses. For example, the number of items administered correlated between .51 and .64 with the standard deviation of scores, making it difficult for the former variable to make an independent contribution to predicting score reliability. It may be the case that a thorough understanding of the variables that affect score reliability must wait for a more thorough understanding of those that affect score variability. Future RGs may

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

686

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

draw on the results of this and other analyses to develop path models or other testable models of score reliability. A disappointing finding from the present study is the small proportion of studies in which the reliability of scores was provided, as there are many important reasons for the inclusion of complete psychometric data in all empirical studies. Thompson and Vacha-Haase (2000) recommend that
large discrepancies between reliability estimates reported in the manual and those obtained in a given study alert the researcher to the possibility that the normative sample and the research sample may represent discrete populations, and thus such comparisons even may bear somewhat on the generalizability of substantive results. (p. 191)

Thus, the calculation (and presentation) of reliability coefficients in substantive research is not of interest solely to psychometricians. Such calculations can influence substantive interpretations and are an integral aspect of complete research reporting.

References
Studies presenting usable internal consistency coefficients are marked with an asterisk. *Allsopp, J., Eysenck, H. J., & Eysenck, S.B.G. (1991). Machiavellianism as a component in psychoticism and extraversion. Personality and Individual Differences, 12, 29-41. American Psychological Association. (1992). PsycINFO Psychological Abstracts Information Services users reference manual. Washington, DC: Author. *Biswas, P. K. (1990). The Eysenck Personality Questionnaire (EPQ) on educated Mizos. Indian Journal of Clinical Psychology, 17, 71-73. Block, J. (1977). The Eysencks and psychoticism. Journal of Abnormal Psychology, 86, 653654. Caruso, J. C. (2000). Reliability generalization of the NEO Personality Scales. Educational and Psychological Measurement, 60, 236-254. Caruso, J. C., & Edwards, S. (in press). Reliability generalization of the Junior Eysenck Personality Questionnaire. Personality and Individual Differences. Cliff, N. (1992). Abstract measurement theory and the revolution that never happened. Psychological Science, 3, 186-190. *Corulla, W. J. (1987). A psychometric investigation of the Eysenck Personality Questionnaire (Revised) and its relationship to the I.7 Impulsiveness Questionnaire. Personality and Individual Differences, 8, 651-658. *Corulla, W. J. (1988). A further psychometric investigation of the Sensation Seeking Scale Form-V and its relationship to the EPQ- R and the I.7 Impulsiveness Questionnaire. Personality and Individual Differences, 9, 277-287. *Corulla, W. J. (1989). The relationships between the Strelau Temperament Inventory, Sensation Seeking and Eysencks dimensional system of personality. Personality and Individual Differences, 10, 161-173. *De Flores, T., & Valdes, M. (1986). Behaviour pattern A: Reward, fight or punishment? Personality & Individual Differences, 7, 319-326. Dunlap, W. P., Silver, N. C., & Phelps, G. R. (1988). A Monte Carlo study of using the first eigenvalue for averaging intercorrelations. Educational and Psychological Measurement, 47, 917-923.

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

CARUSO ET AL.

687

*Egan, V., Miller, E., & McLellan, I. (1998). Does the personal questionnaire provide a more sensitive measure of cardiac surgery related-anxiety than a standard pencil-and-paper checklist? Personality & Individual Differences, 24, 465-473. Eysenck, H. J., & Eysenck, S.B.G. (1964). Manual of the Eysenck Personality Inventory. London: University of London Press. Eysenck, H. J., & Eysenck, S.B.G. (1975). Manual of the Eysenck Personality Questionnaire. London: Hodder & Stoughton/EdITS. Eysenck, H. J., & Eysenck, S.B.G. (1994). Manual of the Eysenck Personality Questionnaire: Comprising the EPQ-Revised (EPQ-R) and EPQ-R Short Scale. San Diego, CA: EdITS. Eysenck, H. J., & Knapp, R. R. (1962). Manual for the Maudsley Personality Inventory. San Diego, CA: EdITS. *Eysenck, S.B.G. (1981). National differences in personality: Sicily and England. Italian Journal of Psychology, 8, 87-93. *Eysenck, S.B.G., & Allsopp, J. F. (1986). Personality differences between students and craftsmen. Personality and Individual Differences, 7, 439-441. *Eysenck, S.B.G., Barrett, P., Spielberger, C., Evans, F. J., & Eysenck, H. J. (1986). Cross-cultural comparisons of personality dimensions: England and America. Personality and Individual Differences, 7, 209-214. *Eysenck, S.B.G., & Chan, J. (1982). A comparative study of personality in adults and children: Hong Kong vs. England. Personality and Individual Differences, 3, 153-160. *Eysenck, S.B.G., Eysenck, H. J., & Barrett, P. (1985). A revised version of the Psychoticism scale. Personality and Individual Differences, 6, 21-29. *Eysenck, S.B.G., & Haapasalo, J. (1989). Cross-cultural comparisons of personality: Finland and England. Personality and Individual Differences, 10, 121-125. *Eysenck, S.B.G., & Long, F. Y. (1986). A cross-cultural comparison of personality in adults and children: Singapore and England. Journal of Personality and Social Psychology, 50, 124-130. *Eysenck, S.B.G., & Tambs, K. (1990). Cross-cultural comparison of personality: Norway and England. Scandinavian Journal of Psychology, 31, 191-197. *Eysenck, S.B.G., & Yanai, O. (1985). A cross-cultural study of personality: Israel and England. Psychological Reports, 57, 111-116. *Fontaine, K. R. (1994). Personality correlates of sexual risk-taking among men. Personality and Individual Differences, 17, 693-694. *French, C. C., & Beaumont, J. G. (1989). A computerized form of the Eysenck Personality Questionnaire: A clinical study. Personality and Individual Differences, 10, 1027-1032. Goh, D. S., King, D. W., & King, L. A. (1982). Psychometric evaluation of the Eysenck Personality Questionnaire. Educational and Psychological Measurement, 42, 297-309. *Gom-i-Freixanet, M. (1997). Consensual validity of the EPQ: Self-reports and spouse-reports. European Journal of Psychological Assessment, 13, 179-185. *Heaven, P.C.L. (1989). Orientation to authority and its relation to impulsiveness. Current Psychology: Research and Reviews, 8, 38-45. *Heavan, P.C.L., Connors, J., & Trevathan, R. (1987). Authoritarianism and the EPQ. Personality and Individual Differences, 8, 677-680. Helmes, E. (1980). A psychometric investigation of the Eysenck Personality Questionnaire. Applied Psychological Measurement, 4, 43-55. *Hosokawa, T., & Ohyama, M. (1993). Reliability and validity of a Japanese version of the short-form Eysenck Personality QuestionnaireRevised. Psychological Reports, 72, 823832. *Jahanshahi, M. (1990). Personality in torticollis: Changes across time. Personality & Individual Differences, 11, 355-363. *Kardum, I., & Hudek-Knezevic, J. (1996). The relationship between Eysencks personality traits, coping styles and moods. Personality and Individual Differences, 20, 341-350.

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

688

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

*Levin, J., & Montag, I. (1987). The effect of testing instructions for handling social desirability on the Eysenck Personality Questionnaire. Personality and Individual Differences, 8, 163167. *Lewis, C. A., & Maltby, J. (1996). Personality, prayer, and church attendance in a sample of male college students in the USA. Psychological Reports, 78, 976-978. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. *McCown, W., Keiser, R., Mulhearn, S., & Williamson, D. (1997). The role of personality and gender in preference for exaggerated bass in music. Personality and Individual Differences, 23, 543-547. *Merten, T., & Ruch, W. (1996). A comparison of computerized and conventional administration of the German versions of the Eysenck Personality Questionnaire and the Carroll Rating Scale for depression. Personality and Individual Differences, 20, 281-291. *Merten, T., & Siebert, K. (1997). A comparison of computerized and conventional administration of the EPQ-R and CRS: Further data on the Merten and Ruch (1996) study. Personality and Individual Differences, 22, 283-286. *Mohan, J., & Virdi, P. K. (1985). Standardisation of P. Q. on Punjab University students. Indian Psychological Review, 28, 20-28. *Mortenson, E. L., Reinisch, J. M., & Sanders, S. A. (1996). Psychometric properties of the Danish 16PF and EPQ. Scandinavian Journal of Psychology, 37, 221-225. *Muntaner, C., Garcia-Sevilla, L., Fernandez, A., & Torrubia, R. (1988). Personality dimensions, schizotypal and borderline personality traits and psychosis proneness. Personality and Individual Differences, 9, 257-268. *Nagoshi, C. T., Pitts, S. C., & Nakata, T. (1993). Intercorrelations of attitudes, personality, and sex role orientation in a college sample. Personality and Individual Differences, 14, 603-604. Nunnally, J. C. (1970). Introduction to psychological measurement. New York: McGraw-Hill. *Parker, J.D.A., Bagby, R. M., & Taylor, G. J. (1989). Toronto Alexithymia Scale, EPQ and selfreport measures of somatic complaints. Personality and Individual Differences, 10, 599-604. *Perera, M., & Eysenck, S.B.G. (1984). A cross-cultural study of personality: Sri Lanka and England. Journal of Cross-Cultural Psychology, 15, 353-371. *San Martini, P., & Mazzoti, E. (1990). Relationships between the factorial dimensions of the Strelau Temperament Inventory and the EPQ-R. Personality and Individual Differences, 11, 909-914. *San Martini, P., Mazzotti, E., & Setaro, S. (1996). Factor structure and psychometric features of the Italian version of the EPQ-R. Personality and Individual Differences, 21, 877-882. Silver, N. C., & Dunlap, W. P. (1987). Averaging correlation coefficients: Should Fishers z transformation be used? Journal of Applied Psychology, 72, 146-148. Sawilowsky, S. S. (2000). Psychometrics versus datametrics: Comment on Vacha-Haases reliability generalization method and some EPM editorial policies. Educational and Psychological Measurement, 60, 157-173. *Tambs, K., Sundet, J. M., Eaves, L., & Berg, K. (1989). Relations between EPQ and Jenkins Activity Survey. Personality and Individual Differences, 10, 1229-1235. *Tarrier, N., Eysenck, S.B.G., & Eysenck, H. J. (1980). National differences in personality: Brazil and England. Personality and Individual Differences, 1, 164-171. Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837-847. Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60, 174-195. Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 620.

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

CARUSO ET AL.

689

Vacha-Haase, T., Kogan, L. R., & Thompson, B. (2000). Sample compositions and variabilities in published studies versus those in test manuals: Validity of score reliability inductions. Educational and Psychological Measurement, 60, 509-522. Vacha-Haase, T., Ness, C., Nilsson, J., & Reetz, D. (1999). Practices regarding reporting of reliability coefficients: A review of three journals. Journal of Experimental Education, 67, 335341. *Vanderzee, K., Buunk, B., & Sanderman, R. (1996). The relationship between social comparison processes and personality. Personality and Individual Differences, 20, 551-565. Viswesvaran, C., & Ones, D. S. (2000). Measurement error in Big Five Factors personality assessment: Reliability generalization across studies and measures. Educational and Psychological Measurement, 60, 224-235. *Weyers, P., Krebs, H., & Janke, W. (1995). Reliability and construct validity of the German version of Cloningers Tridimensional Personality Questionnaire. Personality and Individual Differences, 19, 853-861. Whittington, D. (1998). How well do researches report their measures? An evaluation of measurement in published educational research. Educational and Psychological Measurement, 58, 21-37. Wilkinson, L., & APA Task Force on Statistical Inference, (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. *Wilson, D. J., & Doolabh, A. (1990). A cross-cultural examination of Howarths primary factors and Eysencks secondary factors among Zimbabwean adolescents. Personality and Individual Differences, 11, 657-662. *Wilson, D. J., & Doolabh, A. (1992). Reliability, factorial validity and equivalence of several forms of the Eysenck Personality Inventory/Questionnaire in Zimbabwe. Personality and Individual Differences, 13, 637-643. *Wilson, D., & Mutero, C. (1989). Personality concomitants of teacher stress in Zimbabwe. Personality and Individual Differences, 10, 1195-1198. Yin, P., & Fan, X. (2000). Assessing the reliability of the Beck Depression Inventory scores: Reliability across studies. Educational and Psychological Measurement, 60, 201-223.

Downloaded from epm.sagepub.com at University of Bucharest on March 6, 2014

Das könnte Ihnen auch gefallen