Measurement Equivalence and Multisource Ratings

Journal of Business and Psychology, Vol. 19, No. 3, Spring 2005 (2005) DOI: 10.
1007/s10869-004-2235-x
MEASUREMENT EQUIVALENCE AND MULTISOURCE RATINGS FOR NON-MANAGERIAL POSITIONS: RECOMMENDATIONS FOR RESEARCH AND PRACTICE James M. Diefendorff
Louisiana State University
Stanley B. Silverman
University of Akron
Gary J. Greguras
Singapore Management University
ABSTRACT: The present investigation applies a comprehensive sequence of conrmatory factor analysis tests (Vandenberg and Lance, Organisational Research Methods, 3, 469, 2000) to the examination of the measurement equivalence of self, peer, and supervisor ratings of non-managerial targets across several performance dimensions. Results indicate a high degree of measurement equivalence across rater sources and performance dimensions. The paper illustrates how this procedure can identify very specic areas of non-equivalence and how the complexity of a multisource feedback system may be represented using such procedures. Implications of these results and recommendations for both research and practice are offered. KEY WORDS: 360 degree feedback; measurement equivalence; non-managerial ratings
Organizations often times implement multisource feedback systems as a tool to develop their employees (Church & Bracken, 1997; Yammarino & Atwater, 1997). In fact, multisource feedback systems are so popular that almost all Fortune 500 companies use this approach to assess managerial performance (Cheung, 1999). Feedback from multiple sources is assumed to provide a clearer and more comprehensive picture of a persons strengths and weaknesses than feedback from only ones
Address correspondences to James M. Diefendorff, The Business School, University of Colorado at Denver, Campus Box 165, P.O. Box 173364, Denver, CO 80217-3364. E-mail: james.diefendorff@cudenver.edu. 399
0889-3268/05/0300-0399/0 2005 Springer Science+Business Media, Inc.
400
JOURNAL OF BUSINESS AND PSYCHOLOGY
supervisor (London & Smither, 1995). This comprehensive feedback is assumed to improve recommendations for individual development and to enhance ones self-awareness and job performance (Tornow, 1993). Ratees in multisource feedback systems typically are managers who are evaluated by some combination of potential rater sources (e.g., supervisors, subordinates and peers). Recognizing the potential advantages of collecting performance information from multiple rater sources, more recently organizations have begun using such systems to evaluate nonmanagerial employees. Despite the popularity of multisource feedback systems, many of the assumptions underlying the advantages and practical uses of these systems remain untested (Church & Bracken, 1997). One often overlooked assumption is that the measurement instrument means the same thing to all raters and functions the same across rater sources (Cheung, 1999). However, if the rating instrument is not equivalent across rater sources, substantive interpretations of the ratings and practical recommendations may be inaccurate or misleading. That is, violating the (often untested) assumption of measurement equivalence can render comparisons across sources meaningless and likely compromises the utility of using the ratings to create personal development plans or to make administrative decisions. As Cheung (1999) noted, multisource feedback systems typically have differences among raters and correctly identifying and understanding these differences is critical for the effective use of the performance information. The current paper has two main objectives. First, a comprehensive data analytic technique for assessing the equivalence of multisource ratings as suggested by Vandenberg and Lance (2000) is discussed and illustrated. Whereas the vast majority of existing research on measurement equivalence has only assessed for conceptual equivalence, the current study describes a series of tests assessing both conceptual and psychometric equivalence. The establishment of measurement equivalence of an instrument across different groups is a prerequisite to making meaningful comparisons between groups (Drasgow & Kanfer, 1985; Reise, Widaman, & Pugh, 1993). In the absence of measurement equivalence, substantive interpretations may be incorrect or misleading (Maurer, Raju, & Collins, 1998; Vandenberg & Lance, 2000). As such, the establishment of the measurement equivalence of instruments has important implications for both theory development and practical interventions. The second purpose of this paper is to apply this data analytic technique to assess the measurement equivalence of multisource feedback ratings across rater sources and performance dimensions for nonmanagerial employees. Although several studies have partially assessed the measurement equivalence of multisource ratings, the ratees in these
J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS
401
studies have primarily been managers. Assessing the measurement equivalence of ratings of non-managerial targets across rater sources represents the next logical step in this line of research and is important because organizations are increasingly using multisource ratings to evaluate and develop non-managerial employees. Establishing the measurement equivalence of multisource ratings is important because multisource feedback reports often are designed to compare and contrast ratings across rater sources (Dalessio & Vasilopoulos, 2001; London & Smither, 1995). Discrepancies between rater sources are often highlighted in feedback meetings and the target is encouraged to reect upon these rater source differences. However, if the measurement equivalence of the multisource instrument across rater sources has not been established, the observed differences may reect non-equivalence between sources rather than true differences. Such misinterpretations could misguide individual development plans and organizational decisions. MEASUREMENT EQUIVALENCE AND MULTISOURCE RATINGS In discussing the measurement equivalence of performance ratings, Cheung (1999) identied two main categories: conceptual equivalence and psychometric equivalence. Conceptual equivalence indicates that the items on the feedback instrument have the same factor structure across different rater sources. That is, across sources there is the same number of underlying dimensions, the specic behaviors (represented as items) load on the same dimensions, and the item loadings are of roughly the same magnitude. Thus, the instrument and the underlying performance constructs conceptually mean the same thing across rater sources. Psychometric equivalence indicates that the instrument not only has the same factor structure, but is responded to in the same manner by the different rater sources. That is, across rater sources the items and scales exhibit the same degree of reliability, variance, range of ratings, mean level of ratings, and intercorrelations among dimensions. A lack of psychometric equivalence may indicate one of several rating biases (e.g., halo, severity), depending on where the non-equivalence lies. Cheung demonstrated how conrmatory factor analysis (CFA) procedures can be used to identify various types of conceptual and psychometric equivalence between self and manager ratings. In addition, he showed how CFA is superior to other methods in identifying non-equivalence. Several studies have used CFA procedures to assess the conceptual equivalence of multisource ratings across sources. For example, Lance and Bennett (1997) analyzed self, supervisor, and peer ratings of eight samples of US Air Force airmen on one general performance factor (labeled Interpersonal Prociency). In 5 of the 8 samples, results indicated that the different rater sources held different conceptualizations of per-
402
formance (i.e., conceptual non-equivalence). As such, these authors suggested that many of the observed differences between rater sources (e.g., difference in mean rating level) may be the result of rater source differences in their conceptualizations of performance. Interestingly, these results are in contrast to a study by Woehr, Sheehan, and Bennett (1999) who also analyzed Air Force airmen ratings from the same large US Air Force Job Performance Measurement (JPM) project. Rather than investigating the overall performance dimension (i.e., Interpersonal Prociency) as suggested and supported by Lance and Bennett (1997), Woehr et al. (1999) analyzed self, supervisor, and peer ratings for the eight performance dimensions separately. With this sample, Woehr et al. (1999) found that the different rater sources were relatively conceptually equivalent. Aside from the conicting results from the two military studies described above, the remaining research on measurement equivalence of multisource ratings generally has found that ratings across rater sources and performance dimensions are conceptually equivalent. For example, Cheung (1999) investigated self and supervisor ratings of 332 mid-level executives on two broad performance dimensions (i.e., labeled internal and external roles). Results indicated that the mid-level executives and their managers were conceptually equivalent. Likewise, Maurer et al. (1998) found that peer and subordinate ratings of managers on a teambuilding dimension were conceptually equivalent (the Maurer et al. study did not test for psychometric invariance). Finally, in the most comprehensive study of the measurement equivalence of multisource ratings to date, Facteau and Craig (2001) analyzed self, supervisor, subordinate, and peer ratings of a managerial sample on seven performance dimensions. Results indicated that ratings from these various rater sources across the seven performance dimensions were conceptually invariant (with the exception of one error covariance in the self and subordinate groups). Taken together, the existing literature on the measurement equivalence of multisource ratings (with the exception of the study by Lance & Bennett, 1997) has indicated that ratings from different sources across various performance dimensions are conceptually equivalent. The current study differs from these existing studies in several important ways. The current study applies a more comprehensive set of nested models to test for both conceptual and psychometric equivalence than do previous studies. For example, the Maurer et al. (1998) and the Facteau and Craig (2001) studies only assessed for conceptual equivalence. The current study also differs from previous studies by conducting several tests for partial invariance and by illustrating how various potential sources of non-equivalence may be identied. This illustration and description can help researchers and practitioners pinpoint the
403
source of non-equivalence, if observed. Finally, the sample used in the current study is a salaried, non-managerial sample, whereas previous studies have investigated managerial or military samples. Although multisource feedback systems were originally, and are typically, designed to evaluate managerial performance, many organizations are beginning to implement such systems for non-managerial positions. As such, this is the rst study to investigate the equivalence of multisource ratings in a civilian, non-managerial sample. This paper rst discusses a sequence of eight nested models that could be used to test for all types of equivalence between groups. Following this discussion, the importance of assessing the measurement equivalence of supervisor, peer, and selfratings of the non-managerial sample of the current study is highlighted. MODELS FOR TESTING THE MEASUREMENT EQUIVALENCE Vandenberg and Lance (2000) recently reviewed and integrated the literature on measurement equivalence in organizational research, calling for increased application of measurement equivalence techniques before substantive hypotheses are tested. In their review, they stated that violations of measurement equivalence assumptions are as threatening to substantive interpretations as is an inability to demonstrate reliability and validity (p. 6). That is, non-equivalence between groups indicates that the measure is not functioning the same across groups and any substantive interpretation of differences (e.g., supervisor ratings of a dimension being higher than peer ratings) may be suspect. As Vandenberg and Lance (2000) note, because classical test theory (Crocker & Algina, 1986) cannot adequately identify measurement equivalence across populations, the recommended approach is to use CFA procedures in a series of hierarchically nested models. The primary advantage of CFA procedures over traditional approaches is that they account for measurement error (Bollen, 1989) in estimating group differences. The advantage of Vandenberg and Lances (2000) approach over other approaches is that it represents the most comprehensive series of measurement equivalence tests in the literature, and thus is best equipped at identifying differences between groups. Applying Vandenberg and Lances (2000) approach, Table 1 presents the sequence of models proposed to test the measurement equivalence of a multisource feedback instrument, the specic nature of the constraint applied at each step, and the implications for rejecting the null hypothesis of no differences between rater sources. We adopt the terminology suggested by Vandenberg and Lance (2000) in identifying the various models and the constraints imposed. The initial test in the sequence is that of the Equality of Covariance Matrices (Model 0), which assesses overall measurement equivalence
404
Table 1 The Sequence of Measurement Invariance Tests used in the Present Investigation Constraint Implications of Rejecting H0
Model
Model Name
Equality of Covariance Matrices 0 (Rg Rg ) Congural Invariance Constrain everything to be equal across groups Test for equivalent factor structure across rater groups. Like items factor loadings are constrained to be equal across sources Like items = intercepts are constrained across rater groups
0
Metric Invariance(Kg Kg ) v v
Scalar Invariance (sg sg ) v v
4 Constrain uniquenesses across rater groups
Invariant Uniquenesses(Hg Hg ) d d
5 Factor variances are constrained to be equal across rater groups Factor covariances across dimensions are constrained equal across rater groups
0
Invariant Factor Variances 0 (Ug Ug ) j j
Invariant Factor Covariances 0 (Ug Ug ) jj jj
Invariant Factor Means (jg jg )
Mean Factor ratings are constrained to be equal across rater groups
Indicates some form of non-equivalence between rater sources There is disagreement in the number or composition of factors across rater sources There is disagreement over the pattern of factor loadings across rater sources. Sources do not agree regarding the relative importance of behavioral indicators in dening the dimension The item indicators have different mean levels across groups, suggesting the possibility of a response bias, such as Leniency/Severity for a rater source The rating instrument exhibits different levels of reliability for different rater sources (e.g., there are differences in the amount of error variance for each item across rater sources) The ratings from different sources do not use an equivalent range of the scale. Thus, non-equivalence could indicate range restriction for a rater source The relationships between constructs within rater sources differs across rater groups. This can reect a Halo error for sources with strong correlations between factors Different sources are rating the focal employee at different levels on the latent construct, suggesting the possibility of a Leniency/Severity bias
405
between sources. In this model, everything is constrained to be equal across rater sources, so failure to reject the null hypothesis indicates complete measurement equivalence across the groups and no subsequent tests are required. This is the most restrictive test and is done rst to identify if any between group differences exist. If the null hypothesis is rejected, subsequent tests are needed to identify the source(s) of the nonequivalence. Although Cheung (1999) did not perform this analysis, Vandenberg and Lance (2000) recommend this as a useful initial test. The second test, referred to as Model 1, because it is the rst test in a series of nested models in the invariance hierarchy, is that of Congural Invariance. Congural invariance is the least restrictive model positing only an equivalent factor structure across rater sources. If the null hypothesis that there is no between group differences is rejected, it is interpreted to mean that the rater sources disagree over the number or composition of factors contained in the instrument. This nding could be due to several issues, including rater sources possessing different understandings of performance, having access to different types of performance information (Campbell & Lee, 1988), or outright disagreeing concerning the ratees job duties. This test is the rst step in demonstrating the conceptual equivalence of the scale across rater sources. If the null hypothesis is rejected, further tests are not conducted because the underlying constructs are dened differently across rater sources. Model 2 provides a test of Metric Invariance, where the factor loadings are constrained to be equal across rater groups. Cheung (1999) identied this as the second test for conceptual equivalence. At issue here is whether the strength of the relationship between specic behaviors (items) and the performance dimensions is the same for different rater sources. For example, congural invariance (Model 1) may be present such that two rater sources agree that a behavior (e.g., listens to others) is related to a performance dimension (e.g., interpersonal skills), but metric invariance (Model 2) may be absent because one source (peers) considers the behavior as more important for dening the dimension than another source (managers). Rejecting the null hypothesis at this step indicates that the item indicators load on the factors differently for different rater sources. If the null hypothesis is not rejected for Model 1 or Model 2, then there is complete conceptual equivalence for the instrument. Some researchers suggest that having conceptual equivalence is the primary requirement for using the instrument in different groups (Cheung & Rensvold, 1998; Reise et al., 1993; Ryan, Chan, Ployhart, & Slade, 1999). That is, although other sources of non-equivalence may be revealed in subsequent tests, having conceptual equivalence indicates that the measure can be compared and used across groups. The remaining models are categorized as tests of psychometric equivalence and reect between
406
group differences which are often related to substantive hypotheses in multisource feedback research (e.g., mean differences, differences in range restriction). A nding of non-equivalence for one of these remaining tests does not necessarily mean the instrument is inappropriate for use across sources. Model 3 is the test of Scalar Invariance, where like items intercepts for different sources are constrained to be equal. In a practical sense, this test examines whether the means of the item indicators are equivalent across rater groups (Bollen, 1989). Differences in intercepts may mean that, although the sources agree conceptually on the dimensions of the instrument, there is one source who consistently rates the focal employees lower (severity bias) or higher (leniency bias) than other sources. The presence of mean differences between sources has been found in the literature, with supervisors rating more leniently than peers and focal employees rating more leniently than both peers and supervisors (Harris & Schaubroeck, 1988). Rejecting the null hypothesis at this step indicates that there may be a response bias (leniency/severity) at the item level for one or more rater sources (Taris, Bok, & Meijer, 1998). Although Cheung did not use this test, it could provide diagnostic evidence regarding the prevalence and consistency of bias in the instrument at the item level of analysis. Model 4 is the test of Invariant Uniquenesses, where the item indicator variances are constrained to be equal across raters. The unique variance for each item is considered to reect measurement error, so constraining these to be equal essentially tests whether the instrument has equivalent levels of reliability across sources. That is, because scale reliability decreases as random measurement error increases, this constraint assesses whether the scale has the same degree of reliability across sources. Rejecting the null hypothesis demonstrates that there are differences in the degree of error present in the measure across rater sources. Possible reasons for differences in measurement error between sources include unequal opportunities to observe performance (Rothstein, 1990), unfamiliarity with scale items, or inexperience with the rating format (Cheung, 1999). Model 5 is the test of Invariant Factor Variances, where the variances of the latent constructs are constrained to be equal across rating sources. Factor variances represent the degree of variability or range of the latent construct used by the rater sources. The question being addressed with this test is whether different rater sources use more or less of the possible range of the performance construct than other sources. Rejecting the null hypothesis suggests that one or more sources have a relatively restricted range in their ratings. Model 6 is the test of Invariant Factor Covariances, where the relationships among the latent constructs within a rater source are
407
constrained to be equal across rater groups. Rejecting the null hypothesis indicates that the rater sources have differences in the strength of the relationships among the latent factors, indicating a possible halo effect for one or more sources. That is, the relationships between the latent factors are different across rater sources, with some sources more strongly discriminating among the performance dimensions than others. There is some evidence that supervisor ratings exhibit greater halo than self-ratings (Holzbach, 1978), suggesting that individuals are better able to make distinctions among dimensions for their own performance than are observers. Model 7 is the test of Invariant Factor Means, where the means of the latent constructs are constrained to be equal across groups. Rejection of the null hypothesis indicates that the rater sources are rating the focal employee at different levels on the latent construct. Similar to the test of Scalar Invariance, this test is a way to evaluate the presence of leniency/ severity for a particular rater source, but at the construct level rather than the item level. Differences in mean ratings may be a more accurate indicator of any leniency or severity bias because idiosyncratic item effects between sources will likely cancel each other out. In addition to these specic constraints applied between rater sources (Models 07), the size of the correlations between sources on a dimension (i.e., the level of agreement) can be investigated from the analyses of the covariance structure in the CFA procedures (Cheung, 1999). Consistent with other conceptualizations, low correlations indicate high disagreement (Harris & Schaubroeck, 1988). Again, the primary advantage of using the CFA results to estimate the correlations between sources on a dimension is that CFA controls for measurement errors. Taken together, the series of hierarchical tests described above provide a comprehensive framework for examining the equivalence of ratings from multiple rater sources across multiple performance dimensions. Multisource Ratings of Non-managerial Jobs As organizations continue their shift toward fewer levels of management (Whitaker, 1992), more emphasis will be placed on individual accountability, performance, and development (Murphy & Cleveland, 1995). As a way to measure and develop individual performance, organizations increasingly are using multisource ratings for both managerial and non-managerial employees. Although previous research generally has observed that ratings of managers are conceptually equivalent across rater sources, research has not investigated whether the same is observed for non-managers. As discussed below, because of the differences between managerial and non-managerial positions, there are several
408
reasons why results from managerial samples may not generalize to nonmanagerial samples. First, employees at different organizational levels (e.g., managers and non-managers) often have different experiences with, and perceptions of, the organization in general, and with evaluation and feedback systems in particular (Ilgen & Feldman, 1983; Mount, 1983; Williams & Levy, 2000). Second, managers within an organization are more likely than non-managers to have participated in the development and implementation of new interventions (e.g., multisource feedback systems), and therefore, are more likely to have a better understanding of their processes and procedures (Pooyan & Eberhardt, 1989; Williams & Levy, 2000). Third, managers are more likely to have received training with respect to appraisal systems than are non-managers (Williams & Levy, 2000). Finally, the nature of work between managers and non-managers often is quite different. That is, managerial work tends to be harder to observe and more discontinuous than non-managerial work (Borman & Brush, 1993) potentially making it more difcult to evaluate. Consistent with these expectations, several empirical studies have found differences between managers and non-managers perceptions and use of appraisal systems. For example, Conway and Huffcutt (1997) found that supervisor and peer ratings were more reliable and contained more true score variance for non-managerial than managerial jobs. Further, Conway and Huffcutt (1997) observed that correlations between supervisor, peer, and self-ratings were moderated by job type (i.e., managerial versus non-managerial) such that correlations between sources were higher for non-managerial jobs than for managerial jobs. In another study, Williams and Levy (2000) found that managers were more satised with their appraisals, perceived the procedures to be fairer, and had higher levels of perceived system knowledge than did non-managers. Additionally, Mount (1983) found that non-supervisory employees responded to performance appraisal systems in a more global way than did supervisors. Given that the requirements of managerial and non-managerial jobs differ, Williams and Levy (2000) called for research investigating the effects of these differences on important individual and organizational processes and outcomes. The current study begins to respond to that call by examining the measurement equivalence of multisource ratings for non-managerial jobs. METHOD Participants The ratees in this study were non-managerial, professional employees (e.g., accountants, engineers, programmers, research scientists) from a Fortune 150 rm. In addition to rating themselves, partic-
409
ipants were rated by their peers and managers. Only participants with ratings from all three sources were included in the present investigation. Furthermore, in instances where an employee had more than one rating from a source (i.e., more than one peer or supervisor rating), a single rating was randomly selected for inclusion in the study. The nal sample consisted of 2158 individuals from each source, for a total of 6474 ratings. No demographic information was collected on the participants. Measures and Procedures Individuals participated in a multisource feedback program for developmental purposes. The feedback program consisted of four tools: (1) the feedback instrument, which was administered to the focal employee, peers, and supervisors; (2) an individual feedback report given to the ratee providing details on how he/she was rated; (3) a self-managed workbook for interpreting the feedback report, developing an action plan, and setting goals for changing behaviors; and (4) a development guide providing specic recommendations for improving performance in particular areas (included behavioral strategies and a list of developmental resources). The original feedback instrument consisted of 53 items assessing a variety of behaviors identied as necessary for successful performance. Participants rated the extent to which an individual displayed each behavior on a scale from 1 (not at all) to 5 (a great deal). RESULTS Preliminary Analyses Rating forms with more than 15% of the responses missing were excluded from further analyses (i.e., missing seven items or more). If less than 15% of the values were missing, values were imputed using the expectation maximization (EM) estimation procedure (Dempster, Laird, & Rubin, 1977). Across sources, this resulted in the estimation and replacement of less than 1.28% of the responses on average (.80% for self ratings, 1.18% for supervisor ratings, and 1.84% for peer ratings). To initially identify the factor structure underlying the instrument, managerial ratings from a separate sample (N 1,012) of non-managerial employees from the same organization were submitted to principle axis exploratory factor analyses (EFA) with varimax rotation. This sample of employees did not differ in any discernable way from the primary sample. Managerial ratings were used for this purpose because managers tend to be the most familiar with rating and evaluating others performance and these ratings may provide a good frame of reference for identifying the underlying structure of performance. In identifying the factors, items that had high loadings (.40 or higher) on a factor and low
410
cross-loading (.32 or lower) on the other factors were retained. The EFA resulted in the extraction of 3 primary factors labeled: Relationships (12 items, a .94), Customer Focus (8 items, a .92), and Continual Learning (8 items, a .91). These dimensions were then conrmed using the focal sample of this study (see below). Overview of Analyses The present investigation used LISREL 8.3 (Joreskog & Sorbom, 1993) to test all of the CFA models. For all analyses, a loading was set to 1.0 for one item chosen arbitrarily from each factor, to scale the latent variables (Bollen, 1989). As a preliminary step, we examined whether the proposed 3-factor structure t the observed covariance matrices separately for each source. Data from all three sources showed good t with the hypothesized factor structure. Next, the ratings were combined into one data set to test the various levels of measurement equivalence. The data were structured in a repeated measures fashion, such that ratings from different sources on the same individual were identied as being the same case. This was done because the ratings from the three sources were not independent (they were all evaluations of the same employee), rending a true multiple-group CFA procedure inappropriate (Cheung, 1999). In evaluating the adequacy of a given model, the present investigation utilized the v2 Goodness of Fit statistic and the following t indices: (a) Tucker Lewis Index (TLI) which is identied in LISREL as the non-normed t index (NNFI; Tucker & Lewis, 1973); (b) root mean square error of approximation (RMSEA; Steiger, 1990); (c) the standardized root mean square residual (SRMR; Bentler, 1995), and (d) the Comparative Fit Index (CFI). The lower bound of good t for the TLI and the CFI is considered to be .90, whereas, for the RMSEA and the SRMR the upper bounds for good t are considered to be .08 and .10, respectively (Vandenberg & Lance, 2000). Although Hu and Bentler (1999) recommended more stringent standards for evaluating model t (CFI .95, TLI .95, RMSEA .06, SRMR .08), Vandenberg and Lance (2000) suggested that this recommendation may be premature and that more research is needed before these new standards are adopted. They suggested that the more commonly accepted standards presented above be considered as the lower bound for good t, and that reaching the standards suggested by Hu and Bentler (1999) would indicate very good t. To analyze the various types of equivalence, the series of hierarchical modeling procedures identied by Vandenberg and Lance (2000) and discussed above were used. For each comparison, we constrained the relevant parameters to be equal across rating sources and examined
411
whether a signicant reduction in model t from the less constrained to the more constrained model occurred. Although the v2 difference test is the most common method of examining the difference between nested models, the present investigation also looked at changes in other t indices to make a more informed decision regarding model t. The rationale for doing so is that with such a large sample, traditional v2 tests and even v2 difference tests may be overly sensitive, yielding signicant results even with small model mist (Cheung & Rensvold, 1999). In addition to comparing models where constraints were applied across all scales and rater sources, partial invariance also was examined where constraints were placed on specic parameters in an attempt to identify the sources of non-equivalence. That is, if non-equivalence was detected, a search for the cause of the non-equivalence was done by selectively freeing up constraints on parameters and examining if the Dv2 values and other t statistics improved. One way partial invariance was examined was to test the series of models identied by Vandenberg and Lance (2000) for each dimension separately. Examining the dimensions separately can help to demonstrate whether the sources are more equivalent on some dimensions than others. A second way partial invariance was examined was by freeing up one source at a time and constraining the remaining two sources to be equal in a round-robin fashion. This procedure can determine whether one rater source is primarily responsible for the mist present in the data (i.e., two sources are similar and one is dissimilar), or whether the mist is fairly uniform across sources (i.e., all three sources are different from each other). To illustrate the nature of the non-equivalences, descriptive data (e.g., factor loadings, latent means, reliabilities) also are reported. Tests of Measurement Equivalence Table 2 presents the t statistics for Models 07 (described above), and the change in v2 and change in df between nested models. For each model, the t indices (with the exception of the v2 goodness of t statistic which is overly sensitive with large samples) were above the minimum t recommendations. Furthermore, although there are signicant changes in v2 for each comparison of nested models, the changes in other t statistics were quite small, suggesting the models may not be much worse and that the signicant change in v2 may be due to the large sample used in this investigation (Cheung & Rensvold, 1999). For illustrative purposes, the model comparisons are pursued with the goal of demonstrating how various measurement differences can be identied. In addition, descriptive data are presented at the item level in Tables 3 (factor loadings and item means) and 4 (variance of the measurement errors and scale reliabilities), and at the latent construct level in Table 5
412
Table 2 Results for the Sequence of Measurement Invariance Tests Model 0. Invariant Covariance Matrices 1. Congural Invariance 1 versus 2 2. Metric Invariance 2 versus 3 3. Scalar Invariance 3 versus 4 4. Invariant Uniquenesses 4 versus 5 5. Invariant Factor Variances 5 versus 6 6. Invariant Factor Covariances 6 versus 7 7. Invariant Factor Means df 2434 v2 4583.49 RMSEA TLI SRMR CFI Ddf .020 .037 .037 .038 .038 .039 .97 .92 .92 .91 .91 .91 .053 .036 .038 .038 .039 .058 .98 .92 50 3332 11741.74 3382 12533.61 3438 12848.62 3444 13020.14 .92 50 .91 56 .91 6 .91 6 3450 13102.80 .039 .91 .057 .91 6 3456 13430.15 .039 .90 .055 .91 327.35* 0 82.66* 0 171.52* 0 315.01* 0 791.87* .01 224.59* 0 Dv2 DCFI
3282 11517.15
Note. * Signicant at p < .05.
(factor means, variances, covariances, and intercorrelations). Tables 68 present the sequence of tests of measurement equivalence separately for each of the three performance dimensions. Finally, Table 9 presents the Dv2 and Ddf for tests of partial invariance where only two sources are constrained to be equal at a time (i.e., one source is freed at a time). Model 0. The overall t of this model was above the minimum t requirements recommended by Vandenberg and Lance (2000) (RMSEA .020; TLI .97; SRMR .053; CFI .98). At this point, given the high degree of t for the fully constrained model, Vandenberg and Lance suggest that no further model testing is required and it may be interpreted that the ratings are equivalent conceptually and psychometrically across rater groups. Given the illustrative nature of the current investigation, and in particular the ability of CFA techniques to reveal various types of rater source differences, we chose to proceed with the hierarchical tests for Models 17, providing descriptive data and tests of partial invariance to identify the nature of potential between group differences. Model 1. The overall t of the test of equivalent factor structure was above the minimum requirements suggested by Vandenberg and Lance (2000) (RMSEA .037; TLI .92; SRMR .036; CFI .92), indicating that the groups used the same number of factors and that the items loaded on the same dimension for each rater source. Thus, performance
413
Table 3 Standardized Factor Loadings and item Means Self-ratings Factor Loading Relationships Item 1+ 2+ 3+ 4 5+ 6*+ 7 8 9 10*+ 11+ 12 Customer Focus Item 1+ 2 3 4 5+ 6+ 7+ 8+ Continuous Learning Item 1+ 2+ 3+ 4+ 5 6+ 7+ 8+ .70 .68 .67 .69 .71 .64 .70 .69 .67 .61 .62 .60 .78 .71 .75 .71 .69 .71 .71 .63 .75 .67 .71 .66 .65 .64 .62 .62 Mean 4.31 3.89 4.02 4.10 4.14 4.27 3.91 4.06 3.88 4.40 4.30 4.00 3.86 4.08 3.95 4.01 3.84 3.71 3.83 3.49 3.97 4.09 3.87 3.91 3.92 3.86 4.15 3.56 Supervisor ratings Factor Loading .76 .73 .74 .77 .76 .75 .74 .74 .76 .73 .66 .67 .79 .74 .80 .73 .75 .71 .74 .60 .80 .76 .76 .75 .70 .67 .69 .58 Mean 4.13 3.70 3.74 3.92 3.91 4.18 3.70 3.92 3.71 4.21 4.04 3.85 3.66 3.92 3.77 3.81 3.71 3.54 3.72 3.34 3.71 3.83 3.58 3.69 3.72 3.71 3.72 3.39 Peer ratings Factor Loading .77 .73 .76 .79 .78 .76 .78 .78 .75 .75 .72 .69 .79 .78 .77 .74 .72 .74 .74 .67 .77 .71 .76 .74 .74 .72 .70 .66 Mean 4.22 3.85 3.91 4.05 4.07 4.21 3.84 4.02 3.77 4.25 4.16 3.94 3.83 4.03 3.94 3.96 3.93 3.80 3.86 3.71 3.94 4.00 3.78 3.94 3.92 3.90 3.95 3.63
* Factor loadings are signicantly different at p < .05. + Item means are signicantly different at p < .05.
is dened similarly across rater sources. Note, of course, that the t of all models will be good given that Model 0 t the data well. Model 2. The overall t of the test of Metric Invariance (equal factor loadings) was quite high (RMSEA .037; TLI .92; SRMR .038; CFI .92), indicating good t as well. Although the change in (Dv2 (50) 224.59, p < .05) was signicant (suggesting that H0 should be rejected), the changes in the other t indices revealed only slight difference between the models, with the SRMR decreasing only by .002
414
Table 4 Scale Reliabilities and item Variances (Variances of Measurement Error) Self-ratings Scale a Relationships Item 1 2 3 4 5 6 7 8* 9* 10 11* 12* Customer Focus Item 1 2 3 4* 5* 6* 7* 8* Continual Learning Item 1 2 3* 4* 5* 6* 7 8* .90 .23 .32 .30 .24 .22 .26 .25 .24 .29 .24 .26 .28 .89 .21 .24 .20 .25 .30 .32 .27 .48 .86 .26 .30 .30 .33 .32 .37 .29 .44 .89 .26 .30 .27 .30 .31 .34 .28 .41 .90 .19 .22 .18 .24 .26 .29 .23 .39 .90 .26 .33 .25 .27 .26 .29 .29 .37 Variance of M.E. Supervisor ratings Scale a .93 .23 .32 .28 .24 .22 .25 .26 .27 .25 .23 .34 .25 .91 .20 .21 .21 .27 .26 .27 .25 .35 Variance of M.E. Peer ratings Scale a .94 .26 .33 .29 .24 .23 .29 .23 .24 .30 .25 .31 .27 Variance of M.E.
Note. * Signicant differences at p < .05.
and all other t indices remaining the same. To get a better feel for the differences in factor loadings, the test of Metric Invariance was conducted independently for each item. Table 3 displays the factor loadings for each item across rater sources and whether the Dv2 was signicant. As can be seen only two items had signicantly different factor loadings across groups, with the self-ratings loading lower than the other two sources in both cases. Examining Model 2 separately for each dimension (Tables 68) does not identify one dimension as provid-
415
Table 5 Means, Variances, Covariances, and Intercorrelations of Latent Factors Self Mean Self Relationships Customer Focus Continuous Learning Supervisor Relationships Customer Focus Continuous Learning Peer Relationships Customer Focus Continuous Learning n1 n2 n3 n4 n5 n6 n7 n8 n9 4.31 3.86 3.97 4.13 3.66 3.71 4.22 3.83 3.94 n1 .22 .18 .18 .04 .01 <.01 .06 .03 .03 n2 .69 .31 .21 <.01 .06 ).01 .03 .07 .03 n3 .67 .66 .33 <.01 .01 .08 .03 .04 .09 n4 .15 .03 .03 .33 .21 .24 .08 .04 .05 Supervisor n5 .04 .19 .03 .65 .32 .23 .05 .08 .05 n6 .03 .03 .21 .64 .63 .42 .05 .03 .12 n7 .21 .09 .08 .23 .14 .13 .38 .27 .29 Peer n8 .11 .21 .12 .12 .24 .08 .74 .35 .29 n9 .10 .08 .24 .14 .14 .29 .73 .77 .41
Note. Correlations are above the diagonal and variances are below the diagonal. Table 6 The Sequence of Invariance Tests for the Relationship Dimension Model 0. Invariant Covariance Matrices 1. Congural Invariance 1 versus 2 2. Metric Invariance 2 versus 3 3. Scalar Invariance 3 versus 4 4. Invariant Uniquenesses 4 versus 5 5. Invariant Factor Variances 5 versus 7 7. Invariant Factor Means df v2 RMSEA TLI SRMR CFI Ddf .031 .049 .049 .049 .049 .050 .051 .97 .93 .93 .93 .93 .93 .93 .073 .034 .037 .037 .038 .079 .075 .98 .94 22 577 3209.55 599 3412.06 623 3503.25 625 3648.73 627 3790.67 .94 22 202.51* .94 24 .93 2 145.48* .93 2 141.94* .93 0 0 91.19* .01 0 68.14* 0 Dv2 DCFI
322 1034.73 555 3141.41
ing particularly poor t compared to the other dimensions. Furthermore, Table 9 does not reveal that one particular rater source was the primary cause of the differences in factor loadings (across all items). Thus, it can be concluded that although very small differences in factor loadings do exist between sources, there is generally high levels of conceptual equivalence and no one dimension or rater source contributed a disproportionate amount to the level of non-equivalence found for the assessed model.
416
Table 7 The Sequence of Invariance Tests for the Customer Focus Dimension Model 0. Invariant covariance matrices 1. Congural Invariance 1 versus 2 2. Metric Invariance 2 versus 3 3. Scalar Invariance 3 versus 4 4. Invariant Uniquenesses 4 versus 5 5. Invariant Factor Variances 5 versus 7 7. Invariant Factor Means df 150 225 239 v2 747.47 671.94 739.52 RMSEA TLI SRMR CFI Ddf .041 .031 .032 .039 .039 .039 .042 .96 .98 .98 .97 .96 .97 .96 .026 .022 .029 .029 .032 .035 .033 .98 .98 14 .98 14 290.70* 253 1030.22 269 1149.61 271 1155.06 273 1315.63 .97 16 119.39* .97 2 .97 2 160.57* .96 .01 5.45 0 0 .01 67.58* 0 Dv2 DCFI
Table 8 The Sequence of Invariance Tests for the Continuous Learning Dimension Model 0. Invariant covariance matrices 1. Congural Invariance 1 versus 2 2. Metric Invariance 2 versus 3 3. Scalar Invariance 3 versus 4 4. Invariant Uniquenesses 4 versus 5 5. Invariant Factor Variances 5 versus 6 7. Invariant Factor Means df 150 v2 875.65 RMSEA TLI SRMR CFI Ddf .046 .058 .057 .061 .061 .061 .064 .94 .92 .92 .91 .92 .91 .90 .046 .036 .040 .041 .043 .053 .055 .97 .94 14 239 1851.46 253 2146.36 269 2245.09 271 2280.52 273 2554.25 .93 14 294.90* .92 16 .92 2 .92 2 273.73* .90 .02 35.43* 0 98.73* 0 .01 79.38* .01 Dv2 DCFI
225 1772.08
Model 3. The overall t of the model testing for Scalar Invariance (equal item means) was quite good (RMSEA .038; TLI .91; SRMR .038; CFI .91), but again there was a signicant change in chi-square between Model 2 and Model 3 (Dv2 (50) 791.87, p < .05), suggesting worse t when the item means were constrained. In addition, the largest changes in other t statistics occurred for this comparison with the CFI and TLI decreasing by .01 and the RMSEA decreasing by .001. The item means and whether they were observed to be signicantly different across sources is presented in Table 3. As indicated, the means
417
Table 9 Tests of Partial Invariance for each Model with the Parameter Two Sources Constrained to be Equal, and the Parameter for One Source Free to Vary Source Ratings Constrained to Be Equal (Freed Source Ratings) Self and Supervisor (Peer) Partial Invariance Model 2. 3. 4. 5. 6. 7. Metric Invariance Scalar Invariance Invariant Uniquenesses Invariant Factor Variances Invariant Factor Covariances Invariant Factor Means Ddf 25 25 28 3 3 2 Dv2 152.47 383.64 147.38 86.95 48.46 232.89 Self and Peer (Supervisor) Ddf 25 25 28 3 3 2 Dv2 97.22 488.26 216.80 153.38 12.14 73.16 Supervisor and Peer (Self) Ddf 25 25 28 3 3 2 Dv2 93.16 309.99 99.48 20.69 59.26 219.82
of 19 of the 28 items are signicantly different across sources. Examining Model 3 for each dimension (Tables 68) does not reveal that the mist is due to one particular dimension. In addition, freeing up the means for a particular source does not result in a nonsignicant Dv2 (Table 9), but does reveal that constraining the peer and supervisor ratings (and freeing up self ratings) yields the smallest increase in v2. Examination of the means reveals that self ratings are the highest for 21 items, peer ratings are the highest for 7 items, and supervisor ratings are the lowest for all items. Model 4. The overall t of the model testing for Invariant Uniquenesses (measurement error/item reliability) was good (RMSEA .038; TLI .91; SRMR .039; CFI .91), but the change in v2 was signicantly different from Model 3 (Dv2 (56) 315.01, p < .05). The change in other t statistics again was negligible, with only the SRMR increasing by .001. The item uniquenesses (variances of measurement errors) and scale reliabilities are presented in Table 4, and as can be seen 14 of the item uniquenesses were signicantly different between rater sources (as indicated by a signicant Dv2 when the uniquenesses for each item were constrained separately). It does not appear that the majority of the mist is due to any one dimension (see Tables 68). Furthermore, freeing up the self ratings and constraining the supervisor and peer ratings results in the smallest increase in v2, suggesting that self-ratings may contribute more to the lack of model t, with their uniquenesses being slightly larger. The impact of having larger item uniquenesses can be seen in the scale reliabilities, with the self-ratings having the lowest reliability across dimensions. Importantly though, the reliabilities for all dimensions and sources are quite high, suggesting that any practical differences in reliability may be negligible.
418
Model 5. The overall t of the model testing for Invariant Factor Variances was good (RMSEA .039; TLI .91; SRMR .058; CFI .91), but was signicantly different from Model 4 (Dv2 (6) 171.52, p < .05). The change in other t statistics was very small with the SRMR having the largest change (an increase of .019). Analyzing this model separately for each performance dimension shows that the Relationships dimension appears to hold the lion share of the mist (Table 6). Specically, the Customer Focus dimension does not reveal a signicant Dv2 for this constraint (Table 7), and the Continuous Learning dimension had a decrease in model t roughly a quarter that of the Relationship dimension (Table 8). These differences can be seen in the data in Table 5, where the variances range from .22 to .38 for the Relationship dimension, from .31 to .35 for the Customer Focus dimension, and from .33 to .42 for the Continuous Learning dimension. In addition, it can be seen that the latent variances are generally smaller for selfratings than the other two sources (Table 5). This difference is reected by the relatively small Dv2 when the factor variances for supervisors and peers are constrained and those for self-ratings are freed (see Table 9). This nding is consistent with Cheung (1999) who found that the variance of self-ratings was smaller than the variance for supervisor ratings. In sum, these ndings demonstrate that self-ratings have less range than the other sources and across sources there are larger differences in the range used for the Relationship dimension than the other dimensions. Model 6. The overall t of the model testing for Invariant Factor Covariances (relationships among factors) was good (RMSEA .039; TLI .91; SRMR .057; CFI .91), but was signicantly different from Model 5 (Dv2 (6) 82.66, p < .05). The change in the other t statistics was negligible, with the SRMR actually improving by .001 and all other statistics remaining the same. The correlations reported above the diagonal in Table 5 are estimated from the variances and covariances presented in the bottom half of the table. As indicated, the correlations between dimensions within a source are of fairly high magnitude, with the supervisor ratings exhibiting the smallest correlations (.64 on average) and the peer ratings the highest (.75 on average). A comparison of models separately by dimensions was not possible because this test is explicitly comparing the relationship between dimensions. The tests of partial invariance across sources (Table 9) revealed that the smallest p <Dv2 occurred when the supervisor rating covariances were free to vary and the self and peer rating covariances were constrained to be equal. Thus, in general supervisor ratings appear to better discriminate among the dimensions than self or peer ratings, and peer ratings tend to discriminate less than self-ratings. Model 7. The overall t of the model testing for Invariant Factor Means was good (RMSEA .039, TLI .90, SRMR .055; CFI .91),
419
but was signicantly different from Model 6 (Dv2 (6) 327.35, p <.05). The change in other t statistics also was generally small with the TLI decreasing by .01 and the SRMR increasing by .002. The factor means are reported in Table 5. Consistent with the item means (test for Scalar Invariance), self-ratings are the highest and supervisor ratings are the lowest for each dimension. With regard to specic individual dimension comparisons, the Customer Focus dimension had the worst level of t (Table 7) and the other two dimensions were roughly equivalent in their levels of t (see Tables 6 and 8). The tests of partial invariance freeing up a source at a time revealed that constraining only the self and peer ratings resulted in the smallest decrease in model t (Table 9). This nding is consistent with the means presented in Table 5. Correlations Between Sources The correlations among the latent factors are presented in the top half of Table 5 and were estimated from the variance-covariance matrix in the bottom half of the table. The average correlation between sources on the same dimension was .18 for self and supervisor ratings, .22 for self and peer ratings, and .25 for supervisor and peer ratings. In addition, the average between source correlation for the dimensions was .20 for Relationships, .21 for Customer Focus, and .25 for Continuous Learning.
DISCUSSION A common practice within multisource feedback systems is to compare ratings between various rater sources. However, if the feedback instrument is not equivalent, then ratings across sources or performance dimensions are not directly comparable (Drasgow & Kanfer, 1985). Surprisingly, the issue of measurement equivalence of multisource ratings has received relatively little attention (Facteau & Craig, 2001). Based on the works of Vandenberg and Lance (2000) and Cheung (1999), this paper applies a method for comprehensively examining the measurement equivalence of self, peer, and supervisor ratings across three performance dimensions for a sample of professional, non-managerial employees. This is the rst study to apply the complete series of nested CFA models recommended by Vandenberg and Lance (2000) to multisource feedback ratings of salaried, non-managerial employees. The primary contribution of this study is that it illustrates, in detail, how the specic sources of rater differences can be identied even with very complex datasets involving multiple raters and multiple performance dimensions. A second contribution of this paper is that it demonstrates that measurement equivalence is present in multisource ratings for a sample of non-managerial professionals.
420
The results of the current study indicate that the ratings on the three performance dimensions largely were conceptually equivalent across rater sources. These results are consistent with the existing literature on the measurement equivalence of multisource ratings (e.g., Maurer et al., 1998) and extend these ndings to a non-managerial sample. As Sulsky and Keown (1998) note, the ultimate utility of multisource systems may depend on our ability to develop some consensus on the meaning of performance. These results are encouraging and suggest that comparisons made between rater sources may provide meaningful information. Further, these results suggest that the differences we observe between rater sources likely are not due to different rater sources conceptualizing performance differently, as was suggested by Campbell and Lee (1988). Future research should investigate the measurement equivalence of ratings over time, as some recent research suggests that the meaning of performance changes over time (Stennett, Johnson, Hecht, Green, Jackson, & Thomas, 1999). Finding that the meaning of performance has changed over time has serious implications for studies investigating behavioral change longitudinally. Recall that the rst and most restrictive test revealed complete equivalence in the data. Because this restrictive test suggested equivalence, it generally is suggested that no further tests of equivalence need to be conducted (Vandenberg and Lance, 2000). However, as illustrated and discussed above, the power of the CFA technique in its ability to detect even very small conceptual and psychometric differences between raters is illustrated in this study. Thus, in the presence of larger differences between rating sources, the techniques demonstrated here could be used to pinpoint the source of non-equivalence down to the item level. Consistent with past research (e.g., Conway & Huffcutt, 1997), the correlations between rater sources were quite low. Specically, the average correlation between sources was .18 for self and supervisor ratings, .22 for self and peer ratings, and .25 for supervisor and peer ratings. These results also are consistent with past research that has observed supervisor-peer correlations to be higher than self-peer or selfsupervisor correlations (e.g., Harris & Schaubroeck, 1988). Importantly, the current study was able to rule out the possibility that these low correlations were due to rater sources interpreting the multisource instrument differently. The observed low correlations also support a fundamental assumption of multisource feedback systems, namely, that different rater sources represent distinct perspectives that provide unique information (London & Smither, 1995). Although supervisors are the most widely used rater source in performance management systems (Murphy & Cleveland, 1995), some have suggested that peers may be a better source of performance information
421
(Murphy & Cleveland, 1995; Murphy, Cleveland, & Mohler, 2001). Peers may be a better source of performance information because they may have more opportunities to observe ratee performance and likely work more closely with one another than do other rater sources (Murphy & Cleveland, 1995). Additionally, peer ratings are often perceived as being better because they appear to be more reliable than supervisory ratings, likely a result of aggregating across peers (Scullen, 1997). Although the current study suggests that self, peer, and supervisor ratings are equally internally consistent, past research has found that the inter-rater reliability of supervisors is higher than that of peers, when controlling for the number of raters (Greguras & Robie, 1998). Future research should continue to investigate the conditions that inuence the quality of multisource ratings.
IMPLICATIONS FOR PRACTICE The model testing framework in this paper is recommended for practitioners to aid them in interpreting results, providing feedback to non-managerial employees, and rening the rating instrument and procedure. The most obvious benet to practitioners of using CFA procedures is that they can identify whether the scale is measuring the same thing across all rater sources (i.e., conceptual equivalence). It is possible that even a carefully developed measure will not be conceptualized the same across rater sources, but instead a direct comparison of the factor structure using the procedures outlined and demonstrated above can assess whether or not the instrument is conceptualized equivalently across rater sources. If the measure is not conceptually equivalent across sources, interpreting between source differences in means, errors, or variances could be inaccurate. Thus, any performance improvement feedback that is given to employees based on this information may not reect employees actual developmental needs. This can lead employees down the wrong path in terms of performance improvement and leave important needs unaddressed. If there is conceptual equivalence across rater sources, a second use for practitioners is to accurately identify between source differences in ratings. For example, it can be used to identify between source differences in rating level (e.g., supervisor ratings are the lowest), which can be used to aid the interpretation of ratings and guide the development plan of the subordinate (e.g., more attention should be given to the relative differences of the ratings within a source, rather than the differences between supervisor and self or peer ratings). Thus, this procedure can aid in the identication of specic psychometric differences between
422
sources, which can assist the practitioner in better understanding the rating system and providing feedback to employees. Another implication of this procedure for practitioners is that the results of these analyses can inform changes that are to be made to the instrument, the instruction and training of raters, and the system as a whole. If problematic items are identied they can be removed. If the performance dimensions are conceptualized differently across rater sources, a reevaluation of how different sources dene performance must be made, and these results should be manifest in future versions of the scale. Furthermore, where there are psychometric differences, rater training and instruction can be used to help eliminate these rating errors by pointing out the propensity to rate high, or to not distinguish among performance dimensions. Finally, the presence of both conceptual and psychometric differences may require an overhaul of the entire system so that a more rigorous and less error-prone procedure is used. Finally, the substantive implication of these ndings is that this study suggests that ratings across rater sources are comparable for nonmanagerial employees. Our results suggest that meaningful comparison may be made across different rater sources. Multisource rating systems assume that a ratees self-awareness is increased by reviewing self-other rating discrepancies (Tornow, 1993) and, in turn, this increase in selfawareness facilitates ratee development and performance improvement (Church, 1997; Tornow, 1993). The results of the current study suggest that from a measurement perspective, the ratings are equivalent and can be meaningfully compared in an attempt to increase ones self-awareness and performance. Future Research Although the accumulating research on the measurement equivalence of multisource ratings generally indicates that the ratings are conceptually equivalent, there are several avenues for future research. First, research should continue to explore ratee and rater characteristics that may inuence the measurement equivalence of multisource ratings (Maurer et al., 1998). Second, given that research indicates that rating purpose differentially impacts the dependability of ratings from different rater sources (e.g., Greguras, Robie, Schleicher, & Goff, 2003) future research should explore the impact that rating purpose may have on the conceptualization of performance, or the use of performance instruments, across different rater sources. Third, recent research has investigated the effects of multisource feedback on employee development longitudinally (e.g., Bailey & Fletcher, 2002). If studies that assess behavioral change are to be meaningfully interpreted, the measurement equivalence
423
of multisource ratings collected longitudinally must rst be established. Fourth, future research needs to examine the extent to which multisource feedback instruments exhibit measurement equivalence across cultures. It is common practice in multinational companies to develop a feedback instrument in one country and use it in multiple countries. Testing for measurement equivalence across countries would provide insight into whether the feedback instrument is interpreted and used similarly by individuals from different cultures.
REFERENCES
Bailey, C., & Fletcher, C. (2002). The impact of multiple source feedback on management development: Findings from a longitudinal study. Journal of Organizational Behavior, 23, 853867. Bentler, P. M. (1995). EQS structural equations program manual. Encino, CA: Multivariate Software. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. Borman, W. C., & Brush, D. H. (1993). More progress toward a taxonomy of managerial performance requirements. Human Performance, 6, 121. Campbell, D. J. & Lee, C. (1988). Self-appraisal in performance evaluation: Development versus evaluation. Academy of Management Review, 13, 302314. Cheung, G. W. (1999). Multifaceted conceptions of self-other ratings disagreement. Personnel Psychology, 52, 136. Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 127. Church, A. H., (1997). Managerial self-awareness in high-performing individuals in organizations. Journal of Applied Psychology, 82, 281292. Church, A. H., & Bracken, D. W. (1997). Advancing the state of the art of 360-degree feedback. Group & Organization Management, 22, 149161. Conway, J. M., & Huffcutt, A. I. (1997). Psychometric properties of multisource performance ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings. Human Performance, 10, 331360. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Fort Worth, TX: Harcourt Brace. Dalessio, A. T., & Vasilopoulos, N. L. (2001). Multisource feedback reports: Content, formats, and levels of analysis. In D. W. Bracken, C. W. Timmreck & A. H. Church (Eds.), The handbook of multisource feedback (pp. 181203). San Francisco, CA: Jossey-Bass. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 138. Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement in heterogeneous populations. Journal of Applied Psychology, 70, 662680. Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86, 215227. Greguras, G. J., & Robie, C. (1998). A new look at within-source interrater reliability of 360degree feedback ratings. Journal of Applied Psychology, 83, 960968. Greguras, G. J., Robie, C., Schleicher, D. J., & Goff, M. (2003). A eld study of the effects of rating purpose on multisource ratings. Personnel Psychology, 56, 121. Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisor, self-peer, and peer-supervisor ratings. Personnel Psychology, 41, 4362. Holzbach, R. L. (1978). Rater bias in performance ratings: Superior, self, and peer ratings. Journal of Applied Psychology, 63, 579588. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for t indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 155.
424
Ilgen, D. R., & Feldman, J. M. (1983). Performance appraisal: A process focus. In L. Cummings & B. Staw (Eds.), Research in organizational behavior (Vol. 5). Greenwich, CT: JAI Press. Joreskog, K., & Sorbom, D. (1993). LISREL 8: Structural equation modeling with the SIMPLIS command language. Chicago: Scientic Software. Lance, C. E., & Bennett, W. Jr. (1997, April). Rater source differences in cognitive representation of performance information. Paper presented at the meeting of the Society for Industrial and Organizational Psychology, St. Louis, MO. London, M., & Smither, J. W. (1995). Can multisource feedback change perceptions of goal accomplishment, self-evaluations, and performance-related outcomes? Theory-based applications and directions for research. Personnel Psychology, 48, 803839. Maurer, T. J., Raju, N. S., & Collins, W. C. (1998). Peer and subordinate appraisal measurement equivalence. Journal of Applied Psychology, 83, 693702. Mount, M. K. (1983). Comparisons of supervisory and employee satisfaction with a performance appraisal system. Personnel Psychology, 36, 99110. Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational, and goal-based perspectives. Thousand Oaks, CA: Sage Publications. Murphy, K. R., Cleveland, J. N., & Mohler, C. J. (2001). Reliability, validity, and meaningfulness of multisource ratings. In D. W. Bracken, C. W. Timmreck & A. H. Church (Eds.), The handbook of multisource feedback (pp. 275288). San Francisco, CA: Jossey-Bass. Pooyan, A., & Eberhardt, B. J. (1989). Correlates of performance appraisal satisfaction among supervisory and non-supervisory employees. Journal of Business Research, 19, 215226. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Conrmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552566. Rothstein, H. R. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322327. Ryan, A. M., Chan, D., Ployhart, R. E., & Slade, L. A. (1999). Employee attitude surveys in a multinational organization: Considering language and culture in assessing measurement equivalence. Personnel Psychology, 52, 3758. Scullen, S. E. (1997). When ratings from one source have been averaged, but ratings from another source have not: Problems and solutions. Journal of Applied Psychology, 82, 880888. Steiger, J. H. (1990). Structural model evaluation and modication: An interval estimation approach. Multivariate Behavioral Research, 25, 173180. Stennett, R. B., Johnson, C. D., Hecht, J. E., Green, T. D., Jackson, K., & Thomas, W. (1999, August). Factorial invariance and multirater feedback. Poster presented at the 14th Annual Conference of the Society for Industrial and Organizational Psychology, Atlanta, GA. Sulsky, L. M., & Keown, L. (1998). Performance appraisal in the changing world of work: Implications for the meaning and measurement of work performance. Canadian Psychology, 39, 5259. Taris, T. W., Bok, I. A., & Meijer, Z. Y. (1998). Assessing stability and change of psychometric properties of multi-item concepts across different situations: A general approach. Journal of Psychology, 132, 301316. Tornow, W. W. (1993). Editors note: Introduction to special issue on 360-degree feedback. Human Resource Management, 32, 211219. Tucker, L. R., & Lewis, C. (1973). The reliability coefcient for maximum likelihood factor analysis. Psychometrika, 38, 110. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 469. Whitaker, A. (1992). The transformation in work: Post-Fordism revisited. In M. Reed & H. Hughes (Eds.) Rethinking organization: New directions in organization theory and analysis. London: Sage.
425
Williams, J. R., & Levy, P. E. (2000). Investigating some neglected criteria: The inuence of organizational level and perceived system knowledge on appraisal reactions. Journal of Business and Psychology, 14, 501513. Woehr, D. J., Sheehan, M. K., & Bennett, W. Jr. (1999, April). Understanding disagreement across rating sources: An assessment of the measurement equivalence of raters in 360 degree feedback systems. Poster presented at the 14th Annual Conference of the Society for Industrial and Organizational Psychology, Atlanta, GA. Yammarino, F., & Atwater, L. (1997). Do managers see themselves as others see them? Implications of self-other rating agreement for human resource management. Organizational Dynamics, 25(4), 3544.

Measurement Equivalence and Multisource Ratings

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Measurement Equivalence and Multisource Ratings

Hochgeladen von

Copyright:

Verfügbare Formate

Journal of Business and Psychology, Vol. 19, No. 3, Spring 2005 (2005) DOI: 10.

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

Scalar Invariance (sg sg ) v v

4 Constrain uniquenesses across rater groups

Invariant Factor Variances 0 (Ug Ug ) j j

JOURNAL OF BUSINESS AND PSYCHOLOGY

Invariant Factor Covariances 0 (Ug Ug ) jj jj

Invariant Factor Means (jg jg )

Mean Factor ratings are constrained to be equal across rater groups

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

Note. * Signicant at p < .05.

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

Note. * Signicant differences at p < .05.

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

322 1034.73 555 3141.41

Note. * Signicant at p < .05.

JOURNAL OF BUSINESS AND PSYCHOLOGY

Note. * Signicant at p < .05.

Note. * Signicant at p < .05.

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

JOURNAL OF BUSINESS AND PSYCHOLOGY

J. M. DIEFENDORFF, S. B. SILVERMAN, AND G. J. GREGURAS

Das könnte Ihnen auch gefallen