Sie sind auf Seite 1von 17

360 RATINGS: AN ANALYSIS OF

ASSUMPTIONS AND A RESEARCH AGENDA


FOR EVALUATING THEIR VALIDITY
Walter C. Borman
University of South Florida
This article argues that assumptions surrounding 360 ratings should be
examined; most notably, the assumptions that different rating sources have
relatively unique perspectives on performance and multiple rating sources
provide incremental validity over the individual sources. Studies generally
support the first assumption, although reasons for interrater disagreement
across different organizational levels are not clear. Two research directions
are suggested for learning more about why different organizational levels
tend to disagree in their ratings and thus how to improve interpretation of
360 ratings. Regarding the second assumption, it is argued we might resur-
rect the hypothesis that low-to-moderate across organizational level inter-
rater agreement is actually a positive result, reflecting different levels rat-
ers each making reasonably valid performance judgments but on partially
different aspects of job performance. Three approaches to testing this hy-
pothesis are offered.
Recently, there has been a surge of interest in 360 rating programs (e.g.,
Bracken 1996; Church & Bracken 1997; Hazucha, Hezlett, & Schneider 1993;
London & Beatty 1993; Tornow 1993). 360 programs involve generating perfor-
mance evaluations on target ratees from multiple sources such as supervisors,
peers, subordinates, and customers, as well as from the targets themselves.
Often 360 ratings are used for developmental feedback. The basic notion here
is that to improve performance, one first needs to get insight about his or her
weaknesses in job performance. Evaluations from employees who have close
working relationships with the ratee are seen as an effective way to provide
this insight. Although feedback for development is a frequent application of the
360 method, additional purposes include for promotion, compensation, succes-
sion planning, or other administrative decisions.
Direct all correspondence to: Walter C. Barman, Department of Psychology, University of South Florida,
Tampa, FL 33620.
Human Resource Management Review, Copyright 0 1997
Volume 7, Number 3, 1997, pages 299-315 by JAI Press Inc.
All rights of reproduction in any form reserved. ISSN:1053-4822
300 HUMAN RESOURCE MANAGEMENT REVIEW VOLUME 7. NUMBER 3. 1997
Regarding how this method might be viewed in the context of industrial and
organizational psychology, Tornow (1993) provided a useful discussion. He
pointed out that the science and practice-related views of 360 evaluations may
be very different. The science take on this method involves interest in accu-
racy of measurement with the ratings and in reducing interrater disagreement
to minimize error in ratings. The practice-related view focuses more on the
usefulness of the method for improving individual and organizational effective-
ness. Practitioners may value the multiple perspectives on a ratees perfor-
mance and stress the learning that can take place from the different sources
views of his or her performance. They may also see considerable value in
simply giving all relevant parties the opportunity to express their views on
target ratees performance. With this latter view, the means (i.e., rating pro-
cess) may be more important than the end (i.e., the validity of the ratings).
In a discussion of the Tornow paper, Dunnette (1993) acknowledged the two
different perspectives on 360 ratings, but pointed out that having good science
supporting practice is also important. I would add that there are important
scientific questions to be asked about 360 feedback ratings that are funda-
mental in supporting the use of this technique. The most important is do
additional rating sources provide incremental validity beyond ratings from a
single source (e.g., supervisors)? This is especially critical if the ratings are
being used for administrative decision making. However, even in the case of
360 for developmental feedback, receiving reasonably valid performance infor-
mation (i.e., correlating highly with actual ratee performance levels) is impor-
tant to properly guide behavioral change. That is, if appropriate change is to
occur, precise and focused knowledge about the areas needing improvement
must first be provided.
Thus, there are two primary assumptions made by practitioners who use
360 ratings. One assumption is that the multiple sources of ratings (e.g.,
peers, subordinates, etc.) each offers at least somewhat unique data on ratee
performance. It follows that we would not expect or even necessarily desire
high interrater agreement across sources. However, provided organizational
level is the unit of importance in gaining multiple perspectives on ratee perfor-
mance, interrater agreement within sources is desirable. That is, if we want
the peer perspective, the subordinate perspective, etc., regarding the perfor-
mance of ratees, we would like to obtain relatively good agreement within each
of these perspectives. To clarify this assumption, lets consider the alternatives
to the interrater agreement results just described. If we obtain very high
interrater agreement across sources, there is little need to collect ratings from
multiple sources. If interrater agreement within sources is low, we might sus-
pect that the ratings contain error, with at least some of the raters providing
invalid evaluations from that sources perspective. The second assumption is
that having additional rating sources providing evaluations yields incremental
validity over and above a single source (e.g., supervisors).
Thus, the desirable psychometric properties of 360 ratings are not typically
discussed. However, the assumptions around this method point toward the
expectation that interrater agreement across sources should be low to moder-
ate and that agreement within sources should be higher. Also, additional
360 RATINGS: AN ANALYSIS OF ASSUMPTIONS 301
sources of rating data should contribute enhanced validity for the performance
evaluations.
By far the most research has been conducted on the first topic listed above.
Regarding this question, several studies have shown that interrater agreement
within-organizational level (e.g., peer-peer or supervisor-supervisor agree-
ment) does, in fact, tend to be greater than across-level agreement (e.g., peer-
supervisor). For example, Berry, Nelson, and McNally 0966) had raters from
two supervisory levels evaluate U.S. Navy personnel separated by a two-year
interval. Correlations between performance ratings within organizational lev-
el across the two years averaged 55. Across the two levels, the correlations
were significantly lower (r = .34). Similarly, Gunderson and Nelson (1966)
analyzed peer and supervisor performance ratings of personnel in a remote
setting (the Antarctic) and found that within-organizational level interrater
reliabilities were higher than across-level reliabilities on every dimension (me-
dians .74 vs. 50). Also relevant, Borman (1974) compared a multitrait-multi-
method (MTMM) matrix using peer-supervisor correlations to estimate con-
vergent validities with a MTMM matrix employing within-organizational level
interrater reliabilities to index convergent validity. For 5 of the 7 dimensions,
within-level interrater agreement was higher than the corresponding across
level inter-rater agreement tms = .45 vs. .32).
The efficacy and utility of across level ratings, in general, and 360 rating
programs in particular, rely on understanding the nature of the rating differ-
ences observed across rater levels. That is, in order to interpret and appro-
priately use across-level rating information for developmental feedback or op-
erational decisions, it is important to know, for example, if raters from
different organizational levels actually observe different behaviors, or see es-
sentially the same behavior but interpret it differently or weight it differently.
Understanding the reasons for interrater disagreement across organizational
level should help to interpret multi-source ratings. We now turn to a review of
models attempting to explain reasons for interrater disagreement across orga-
nizational levels. This review is restricted to literature dealing with the peer
and supervisor rating sources, as there were sufficient studies that examined
these rating sources. This was not the case for research involving subordinate
or customer rating sources. As discussed above, a central focus in our review is
toward understanding why raters from different organizational levels disagree
in their performance ratings and how this holds meaning for the 360 ap-
proach. Accordingly, considerable attention is directed toward two basic hy-
potheses that have been offered to explain why different level raters disagree.
These are that raters at different levels (1) use different dimensions or weight
dimensions differently as a basis for evaluating performance or (2) actually
observe different performance-related behaviors in target ratees.
HYPOTHESIZED REASONS FOR PEER-SUPERVISOR
DISAGREEMENT IN RATINGS
Table 1 presents the two central hypotheses regarding why peers and super-
visors disagree in their performance ratings to the extent they do. As men-
302 HUMAN RESOURCE MANAGEMENT REVIEW VOLUME 7, NUMBER 3. 1997
TABLE 1
Hypotheses About Why Across-Organizational Level
Interrater Aareement is Relativelv Low
1. Different Levels Raters Use Different Performance Models
a. They Use Different Dimensions or Define Dimensions Differently
l When asked to generate performance dimensions, different levels produce different
dimensions (Borman 1974; Zedeck et al. 1974)
l Raters from different levels use different patterns of cues when making performance
judgments (Oppler et al. in preparation)
l Raters from different levels have different implicit performance theories (Oppler &
Sager 1992; Pulakos & Borman 1986)
b. They Use Similar Dimensions but Different Weights for the Dimensions in Making
Overall Performance Judgments
l Across-level interrater agreement on individual dimensions is high relative to agree-
ment on overall performance dimensions (Harris & Schaubroeck 1988)
l Regression models, with overall performance ratings regressed on dimension rat-
ings, show significant differences for different organizational level raters (Borman et
al. 1995; Pulakos et al. 1996; Tsui 1984; Tsui & Ohlott 1988)
2. Different Levels Raters Use Different Samplings of Ratee Behavior in Making Perfor-
mance Judgments
a. They Have Different Opportunities to Observe Ratees, and Thus See Different Ratee
Behavior.
l Different levels have different opportunities to see ratees in various roles (lsui 1984)
l Peer raters have more opportunity to view ratees than do supervisors, and therefore
self-peer interrater agreement (same organizational level) should be higher than self-
supervisor or peer-supervisor interrater agreement (different levels) (Harris & Schau-
broeck 1988)
tioned, the first hypothesis is that different levels raters use different perfor-
mance models when making ratings. Beneath the first hypothesis in the table
are two different versions of the hypothesis, along with essentially operational
definitions of each version. The second hypothesis is that different levels rat-
ers use different samplings of ratee behavior in making performance judg-
ments. Table 2 summarizes results for studies examining these hypotheses. It
should be mentioned that the different versions of the hypotheses and opera-
tional definitions are not completely independent from each other. This be-
comes clear when we see that some of the empirical studies to be reviewed
apply to more than one version. Nonetheless, we believe this summary of the
various explanations for across-organizational level interrater disagreement
provides a useful framework for assessing the empirical research bearing on
this issue and for helping to interpret 360 ratings.
Hypothesis 1, Version 1. This hypothesis is that different level raters use
different dimensions or that they somehow define dimensions differently
when making ratings. The most direct test of this hypothesis has been to
have individuals from different organizational levels simply generate the
important performance dimensions from their own perspective and compare
these dimension sets across level. Borman (1974) took this approach with
360 RATINGS: AN ANALYSIS OF ASSUMPTIONS 303
secretaries in a university setting and their instructor bosses. As part of a
behaviorally anchored rating scale development process, each group inde-
pendently generated what they believed were the important performance
dimensions, and the results demonstrated that the dimension sets showed
only modest conceptual similarity (pp. 105). No quantitative comparison of
the two dimension sets was offered, however. Similarly, Zedeck, Imparato,
Krausz, and Oleno (1974) had two groups, peers and supervisors of hospital
nurses, develop performance appraisal dimensions independent of each oth-
er. Subjective comparisons suggested considerable overlap in this case, and
a consensus meeting between the groups resulted in agreement on a single
dimension set (Zedeck et al. 1974).
Another way to address the different-dimensions version of Hypothesis 1 is
to evaluate whether or not raters from different organizational levels use dif-
ferent patterns of factors or cues when making performance judgments. The
reasoning is that if different levels attend to different cues, they may have at
least partially different dimension sets influencing their judgments about rat-
ee performance. Relevant to this version of Hypothesis 1, in a Project A sample
(a large scale selection and classification study conducted in the U.S. Army
[Campbell 199011, Oppler, P u a 1 k OS, and Borman (in preparation) developed a
path model with the cue (or exogenous) variables being ratee ability, technical
proficiency (from work sample tests), job knowledge, personality (two vari-
ables), and objective measures of commendable and disciplinary problem be-
havior. The endogenous variable in the model was the performance ratings.
Results showed very similar patterns of path coefficients when a peer rating
model was compared to a supervisory rating model, with ratee problem behav-
ior, dependability, and achievement orientation demonstrating the strongest
effects on both peer and supervisor ratings.
A study by Borman, White, and Dorsey (1995) came to essentially the same
conclusion. In another Project A sample, ratee technical proficiency and de-
pendability had the largest path coefficients to the overall job performance
ratings made by supervisors and peers. However, Pulakos, Schmitt, and Chan
(1996) studied supervisor and peer ratings of professionals in a large govern-
ment agency and found that significantly different path coefficients were evi-
dent for the paths between cue variables and overall performance ratings for
the two sources. Especially large differences occurred for the written technical
proficiency-rating and role-play proficiency-rating paths.
Still another approach to test the different-dimension version of Hypothesis
1, albeit quite indirect, comes from the Harris and Schaubroeck (1988) meta-
analysis. They reasoned that if there is low interrater reliability across-organi-
zational level, both for the individual dimension and overall performance rat-
ings, then the different levels raters may be using different dimensions. They
argued that moderate levels of interrater agreement for dimension and overall
performance ratings across the peer and supervisor levels (corrected r = .62)
provided some evidence that similar dimensions were being employed by the
two levels raters.
Finally, comparisons of what might be called implicit performance theories
held by supervisors versus peers also seem relevant to this hypothesis. First,
T
A
B
L
E

2

S
u
m
m
a
r
y

o
f

S
t
u
d
i
e
s

E
x
a
m
i
n
i
n
g

R
e
a
s
o
n
s

f
o
r

A
c
r
o
s
s
-
O
r
g
a
n
i
z
a
t
i
o
n
a
l

L
e
v
e
l

I
n
t
e
r
r
a
t
e
r

D
i
s
a
g
r
e
e
m
e
n
t

i
n

R
a
t
i
n
g
s

S
t
u
d
y

a
n
d

R
a
t
i
n
g

S
o
u
r
c
e
s

H
y
p
o
t
h
e
s
e
s

T
e
s
t

R
e
s
u
l
t
s

a
n
d

C
o
n
c
l
u
s
i
o
n
s

B
e
r
r
y

e
t

a
l
.

(
1
9
6
6
)

(
1
s
t

a
n
d

2
n
d

l
i
n
e

s
u
p
e
r
v
i
s
o
r
s
)

Z
e
d
e
c
k

a
n
d

B
a
k
e
r

(
1
9
7
2
)

(
1
s
t

a
n
d

2
n
d

l
i
n
e

s
u
p
e
r
v
i
s
o
r
s
)

B
o
r
m
a
n

(
1
9
7
4
)

(
p
e
e
r
-

s
u
p
e
r
v
i
s
o
r
)

K
l
i
m
o
s
k
i

a
n
d

L
o
n
d
o
n

(
1
9
7
4
)

(
s
e
l
f
-
p
e
e
r
-
s
u
p
e
r
v
i
s
o
r
)

Z
e
d
e
c
k

e
t

a
l
.

(
1
9
7
4
)

(
p
e
e
r
-

s
u
p
e
r
v
i
s
o
r
)

T
s
u
i

(
1
9
8
4
)

(
p
e
e
r
-
s
u
p
e
r
v
i
s
o
r
-

s
u
b
o
r
d
i
n
a
t
e
-
s
e
l
f
)

P
u
l
a
k
o
s

a
n
d

B
o
r
m
a
n

(
1
9
8
6
)

(
p
e
e
r
-
s
u
p
e
r
v
i
s
o
r
)

H
a
r
r
i
s

a
n
d

S
c
h
a
u
b
r
o
e
c
k

(
1
9
8
8
)

(
m
e
t
a
-
a
n
a
l
y
s
i
s
:

_
^
^


^
.
.
.
_
_
_
_
:
_
_
_

_
-
I
t
\

U
s
e

D
i
f
f
e
r
e
n
t

D
i
m
e
n
s
i
o
n
s

W
i
t
h
i
n
-
l
e
v
e
l

v
s
.

A
c
r
o
s
s
-
l
e
v
e
l

I
l
?

S
e
e

D
i
f
f
e
r
e
n
t

S
a
m
p
l
e
s

o
f

B
e
h
a
v
i
o
r

a
n
d

V
a
l
u
e

B
e
h
a
v
i
o
r

D
i
f
f
e
r
e
n
t
l
y

H
a
v
e

D
i
f
f
e
r
e
n
t

o
r
i
e
n
t
a
t
i
o
n
s


t
o
w
a
r
d

R
a
t
e
e
s

a
n
d

S
e
e

D
i
f
f
e
r
e
n
t

S
a
m
p
l
e
s

o
f

B
e
h
a
v
i
o
r

C
o
n
v
e
r
g
e
n
t

a
n
d

D
i
s
c
r
i
m
i
n
a
n
t

V
a
l
i
d
i
t
y

o
f

R
a
t
i
n
g
s

U
s
e

D
i
f
f
e
r
e
n
t

D
i
m
e
n
s
i
o
n
s

I
n
d
e
p
e
n
d
e
n
t

D
e
v
e
l
o
p
m
e
n
t

o
f

P
e
e
r

a
n
d

S
u
p
e
r
v
i
s
o
r

D
i
m
e
n
s
i
o
n
s
:

I
R

C
o
m
p
a
r
e
d

W
i
t
h
i
n

a
n
d

A
c
r
o
s
s
-
l
e
v
e
l

a
n
d

o
n

O
w
n

a
n
d

O
t
h
e
r
s

D
i
m
e
n
s
i
o
n
s

H
i
e
r
a
r
c
h
i
c
a
l

F
a
c
t
o
r

A
n
a
l
y
s
i
s

o
f

A
l
l

R
a
t
i
n
g
s

V
a
l
u
e

B
e
h
a
v
i
o
r

D
i
f
f
e
r
e
n
t
l
y

1
.

U
s
e

D
i
f
f
e
r
e
n
t

W
e
i
g
h
t
s

f
o
r

D
i
m
e
n
s
i
o
n
s

2
.

S
e
e

D
i
f
f
e
r
e
n
t

S
a
m
p
l
e
s

o
f

B
e
h
a
v
i
o
r

U
s
e

D
i
f
f
e
r
e
n
t

I
m
p
l
i
c
i
t

P
e
r
f
o
r
m
a
n
c
e

T
h
e
o
r
i
e
s

1
.

U
s
e

D
i
f
f
e
r
e
n
t

W
e
i
g
h
t
s

f
o
r

D
i
m
e
n
s
i
o
n
s

I
n
d
e
p
e
n
d
e
n
t

D
e
v
e
l
o
o
m
e
n
t

o
f

D
i
m
e
n
s
i
o
n
s

a
n
d

R
a
t
i
n
g

B
e
h
a
v
i
o
r
a
l

E
x
a
m
p
l
e
s

R
e
g
r
e
s
s
e
d

O
v
e
r
a
l
l

P
e
r
f
o
r
m
a
n
c
e

R
a
t
i
n
g
s

o
n

R
o
l
e

E
f
f
e
c
t
i
v
e
n
e
s
s

R
a
t
i
n
g
s

f
o
r

E
a
c
h

R
a
t
i
n
g

S
o
u
r
c
e

L
.

C
o
m
p
a
r
i
s
o
n
s

o
f

R
a
t
i
n
g
s

o
f

O
b
s
e
r
v
e
d

B
e
h
a
v
i
o
r

o
n

S
i
x

R
a
t
e
e

R
o
l
e
s

a
c
r
o
s
s

R
a
t
i
n
g

S
o
u
r
c
e
s

C
o
m
p
a
r
e
d

F
a
c
t
o
r

S
t
r
u
c
t
u
r
e
s

o
f

P
e
e
r

a
n
d

S
u
p
e
r
v
i
s
o
r

R
a
t
i
n
g
s

1
.

I
s

P
e
e
r
-
S
u
p
e
r
v
i
s
o
r

I
R

H
i
g
h

o
n

D
i
m
e
n
s
i
o
n

R
a
t
i
n
g
s

b
u
t

L
o
w

o
n

,
*
_

*

W
i
t
h
i
n
-
l
e
v
e
l

I
R

H
i
g
h
e
r

T
h
a
n

A
c
r
o
s
s
-
l
e
v
e
l

I
R

M
o
d
e
r
a
t
e

C
o
n
v
e
r
g
e
n
t

a
n
d

L
o
w

D
i
s
c
r
i
m
i
n
a
n
t

V
a
l
i
d
i
t
y

D
i
f
f
e
r
e
n
t

D
i
m
e
n
s
i
o
n
s

G
e
n
e
r
a
t
e
d
;

I
R

H
i
g
h
e
r

W
i
t
h
i
n
-
l
e
v
e
l

a
n
d

o
n

O
w
n

D
i
m
e
n
s
i
o
n
s

H
a
l
o

a
n
d

T
h
r
e
e

R
a
t
i
n
g

S
o
u
r
c
e

F
a
c
t
o
r
s

E
m
e
r
g
e
,

A
l
o
n
g

w
i
t
h

T
w
o

C
o
n
t
e
n
t

F
a
c
t
o
r
s

S
i
m
i
l
a
r

D
i
m
e
n
s
i
o
n
s

G
e
n
e
r
a
t
e
d
;

P
e
e
r
s

G
a
v
e

H
i
g
h
e
r

E
f
f
e
c
t
i
v
e
n
e
s
s

V
a
l
u
e
s

t
o

E
x
a
m
p
l
e
s

1

N
o

S
i
g
n
i
f
i
c
a
n
t

D
i
f
f
e
r
e
n
c
e

i
n

P
e
e
r
,

S
u
p
e
r
v
i
s
o
r
,

a
n
d

S
u
b
o
r
d
i
n
a
t
e

R
e
g
r
e
s
s
i
o
n

M
o
d
e
l
s

2
.

D
i
f
f
e
r
e
n
c
e
s

b
e
t
w
e
e
n

S
o
u
r
c
e
s

W
e
r
e

S
m
a
l
l

F
a
c
t
o
r

S
t
r
u
c
t
u
r
e
s

V
e
r
y

S
i
m
i
l
a
r

1
.

I
R

S
a
m
e

f
o
r

D
i
m
e
n
s
i
o
n

a
n
d

O
v
e
r
a
l
l

P
e
r
f
o
r
m
a
n
c
e

_

_
.

_

-

-

.
-
-
-

-
-

T
s
u
i

a
n
d

O
h
l
o
t
t

(
1
9
8
8
)

(
p
e
e
r
-

s
u
p
e
r
v
i
s
o
r
-
s
u
b
o
r
d
i
n
a
t
e
)

B
o
r
m
a
n

e
t

a
l
.

(
1
9
9
5
)

(
p
e
e
r
-

s
u
p
e
r
v
i
s
o
r
)

P
u
l
a
k
o
s

e
t

a
l
.

(
1
9
9
6
)

(
p
e
e
r
-

s
u
p
e
r
v
i
s
o
r
)

O
p
p
l
e
r

e
t

a
l
.

(
i
n

p
r
e
p
a
r
a
t
i
o
n
)

(
p
e
e
r
-
s
u
p
e
r
v
i
s
o
r
)

2
.

U
s
e

D
i
f
f
e
r
e
n
t

D
i
m
e
n
s
i
o
n
s

3
.

S
e
e

D
i
f
f
e
r
e
n
t

S
a
m
p
l
e
s

o
f

B
e
h
a
v
i
o
r

1
.

D
i
f
f
e
r
e
n
t

T
y
p
e
s

a
n
d

A
m
o
u
n
t
s

o
f

R
a
t
e
r

E
r
r
o
r
s

A
c
r
o
s
s

S
o
u
r
c
e
s

2
.

S
e
e

D
i
f
f
e
r
e
n
t

S
a
m
p
l
e
s

o
f

B
e
h
a
v
i
o
r

3
.

U
s
e

D
i
f
f
e
r
e
n
t

W
e
i
g
h
t
s

f
o
r

D
i
m
e
n
s
i
o
n
s

D
e
f
i
n
e

P
e
r
f
o
r
m
a
n
c
e

D
i
f
f
e
r
e
n
t
l
y

D
e
f
i
n
e

P
e
r
f
o
r
m
a
n
c
e

D
i
f
f
e
r
e
n
t
l
y

D
e
f
i
n
e

P
e
r
f
o
r
m
a
n
c
e

D
i
f
f
e
r
e
n
t
l
y

2
.

I
s

P
e
e
r
-
s
u
p
e
r
v
i
s
o
r

I
R

L
o
w

f
o
r

B
o
t
h

D
i
m
e
n
s
i
o
n

a
n
d

O
v
e
r
a
l
l

P
e
r
f
o
r
m
a
n
c
e

R
a
t
i
n
g
s

3
.

I
s

S
e
l
f
-
p
e
e
r

(
s
a
m
e

o
r
g
.

l
e
v
e
l
)

I
R

H
i
g
h
e
r

T
h
a
n

S
e
l
f
-
s
u
p
e
r
v
i
s
o
r

(
d
i
f
f
e
r
e
n
t

o
r
g
.

l
e
v
e
l
s
)

I
R

1
.

N
o
t

T
e
s
t
e
d

2
.

N
o
t

T
e
s
t
e
d

3
.

P
o
l
i
c
y

C
a
p
t
u
r
i
n
g

W
e
i
g
h
t
s

a
n
d

S
u
b
j
e
c
t
i
v
e

W
e
i
g
h
t
s

A
s
s
i
g
n
e
d

W
e
r
e

C
o
m
p
a
r
e
d

A
c
r
o
s
s

O
r
g
i
n
i
n
a
l

L
e
v
e
l
s

C
o
m
p
a
r
e
d

P
e
e
r

a
n
d

S
u
p
e
r
v
i
s
o
r

R
a
t
i
n
g

M
o
d
e
l
s

w
i
t
h

S
e
v
e
r
a
l

C
h
a
r
a
c
t
e
r
i
s
t
i
c
s

o
f

R
a
t
e
e
s

a
s

E
x
o
g
e
n
o
u
s

V
a
r
i
a
b
l
e
s

a
n
d

R
a
t
i
n
g
s

a
s

t
h
e

E
n
d
o
g
e
n
o
u
s

V
a
r
i
a
b
l
e

C
o
m
p
a
r
e
d

P
e
e
r

a
n
d

S
u
p
e
r
v
i
s
o
r

R
a
t
i
n
g

M
o
d
e
l
s

w
i
t
h

S
e
v
e
r
a
l

C
h
a
r
a
c
t
e
r
i
s
t
i
c
s

o
f

R
a
t
e
e
s

a
s

E
x
o
g
e
n
o
u
s

V
a
r
i
a
b
l
e
s

a
n
d

R
a
t
i
n
g
s

a
s

t
h
e

E
n
d
o
g
e
n
o
u
s

V
a
r
i
a
b
l
e

C
o
m
p
a
r
e
d

P
e
e
r

a
n
d

S
u
p
e
r
v
i
s
o
r

R
a
t
i
n
g

M
o
d
e
l
s

w
i
t
h

S
e
v
e
r
a
l

C
h
a
r
a
c
t
e
r
i
s
t
i
c
s

o
f

R
a
t
e
e
s

a
s

E
x
o
g
e
n
o
u
s

V
a
r
i
a
b
l
e
s

a
n
d

R
a
t
i
n
g
s

a
s

t
h
e

E
n
d
o
g
e
n
o
u
s

V
a
r
i
a
b
l
e

2
.

I
R

M
o
d
e
r
a
t
e

f
o
r

D
i
m
e
n
s
i
o
n

a
n
d

O
v
e
r
a
l
l

P
e
r
f
o
r
m
a
n
c
e

3
.

I
R

S
a
m
e

f
o
r

S
e
l
f
-
p
e
e
r

a
n
d

S
e
l
f
-
s
u
p
e
r
v
i
s
o
r

1
.

N
/
A

2
.

N
/
A

3
.

S
m
a
l
l

P
o
l
i
c
y

C
a
p
t
u
r
i
n
g

D
i
f
f
e
r
e
n
c
e
s

B
u
t

D
i
f
f
e
r
e
n
c
e
s

i
n

S
u
b
j
e
c
t
i
v
e

W
e
i
g
h
t
s

A
c
r
o
s
s

L
e
v
e
l
s

P
e
e
r

a
n
d

S
u
p
e
r
v
i
s
o
r

M
o
d
e
l
s

S
i
m
i
l
a
r

P
e
e
r

a
n
d

S
u
p
e
r
v
i
s
o
r

M
o
d
e
l
s

H
a
d

S
i
g
n
i
f
i
c
a
n
t

D
i
f
f
e
r
e
n
c
e
s

P
e
e
r

a
n
d

S
u
p
e
r
v
i
s
o
r

M
o
d
e
l
s

V
e
r
y

S
i
m
i
l
a
r

a
l
n
t
e
r
r
a
t
e
r

r
e
l
i
a
b
i
l
i
t
v

306 HUMAN RESOURCE MANAGEMENT REVIEW VOLUME 7. NUMBER 3. 1997
in work on Project A (Campbell 1990), Pulakos and Borman (1986) factor an-
alyzed separately supervisor and peer ratings of more than 800 first tour
soldiers on 10 performance dimensions and found remarkably similar 3-factor
solutions for each set of ratings, with the patterns of loadings almost identi-
cal.
Second, Oppler and Sager (1992) analyzed performance ratings of more than
35,000 army trainees, each by two peers and one drill instructor supervisor. A
4-factor model derived from the Peer 1 rating group (one of the two peers
randomly selected for each ratee) was then tested for fit with the Peer 2 ratings
and with the supervisor ratings using confirmatory factor analysis. Results
showed a significantly better fit with the other peer ratings compared to the
supervisor ratings, suggesting that the two organizational levels may use at
least somewhat different implicit performance theories.
Overall, however, the available evidence does not strongly support the no-
tion that peer and supervisor raters use significantly different dimensions or
employ different implicit performance theories when rating job performance.
The Zedeck et al., Oppler et al., Borman et al., and Pulakos and Borman
studies, each using a different analysis approach, provide relatively strong
contradictory evidence for this hypothesis. On the other hand, the Oppler and
Sager, and Pulakos et al. studies suggest some support for a different-models
explanation of rater disagreement. It should be noted, however, the Oppler and
Sager findings might have resulted from the very high power of their design for
detecting differences across sources.
Hypothesis 1, Version 2. The second version of Hypothesis 1 is that raters
from different organizational levels use similar dimensions in making per-
formance judgments, but that they weight these dimension differently. Har-
ris and Schaubroeck (1988) tested this version by comparing the levels of
peer-supervisor interrater agreement on individual dimensions and on
overall performance dimensions. The reasoning was that across-level inter-
rater agreement should be adversely affected only with the overall perfor-
mance dimension because the two levels essentially form different summary
composite performance judgments due to the different patterns of dimen-
sion weights employed by peers and supervisors. Their meta-analysis re-
sults showed no differences in interrater agreement for the two kinds of
dimensions, evidence counter to this explanation. Similarly, results from
several other studies using peer and supervisor ratings (e.g., Borman &
Pulakos 1986; Borman, Mendel, Lammlein, & Rosse 1981; Borman, To-
quam, & Rosse 1978) demonstrate that overall performance dimensions
have higher across-level interrater reliabilities than do individual dimen-
sions, again providing contradictory evidence for the different weights ver-
sion of Hypothesis 1.
In another study, Tsui and Ohlott (1988) gathered performance ratings from
the supervisors, peers, and subordinates of more than 200 manager ratees.
Raters also made evaluations of ratees on six managerial roles (e.g., leader,
liaison), on need for power and achievement, and on interpersonal competence.
360 RATINGS: AN ANALYSIS OF ASSUMPTIONS 307
Multiple regression results for the supervisor, peer, and subordinate rater
groups were quite similar. In each of the three models, interpersonal compe-
tence and the managerial roles accounted for the most variance in the overall
performance ratings. Thus, the patterns of weights were quite similar for the
different organizational levels, casting doubt on the validity of the different
weights hypothesis.
Similarly, Tsui (1984) gathered peer, supervisor, subordinate, and self rat-
ings on the same six reputational roles used by Tsui and Ohlott, and on overall
performance. When the overall ratings were regressed on the role effectiveness
rating, no significant differences were found between peer, supervisor, and
subordinate regression models, providing more support for rejection of the
different-weights explanation of across-organizational level interrater dis-
agreement.
Finally, it might be argued that the Oppler et al. (in preparation) and Bor-
man et al. (1995) studies are relevant to the different-weights version as well as
the different-dimension version of Hypothesis 1. If we consider the exogenous
non-rating variables (such as ratee personality, job knowledge, and proficiency)
as dimensions of performance, the findings that the peer and supervisor re-
gression and path models of performance ratings were very similar could be
interpreted as supporting the conclusion that the weights different levels rat-
ers place on dimensions are similar. Of course, the Pulakos et al. (1996) study
is not as supportive of this conclusion.
Hypothesis 2. This hypothesis is that raters from different organizational
levels tend to disagree in their ratings because they use different samplings
of ratee behavior when making performance judgments. They use different
behavior samples presumably because they have different opportunities to
view ratee behavior. That is, only a subset of the behavior relevant to effec-
tive performance is observable from a particular organizational level. This
hypothesis has not been as well tested as the first one, with only a couple of
studies at all relevant.
First, Tsui (1984) asked her supervisor, peer, and subordinate raters to
indicate how much ratee behavior they observed relative to each of the six
reputational roles. Results of a MANOVA, with the six mean behavior-ob-
served ratings as the dependent variables, were significant but the effect size
was quite small. At the same time, across-organizational interrater agreement
was low, with the average superior-subordinate correlation about .15, the sub-
ordinate-peer correlations less than .20, and the average peer-supervisor cor-
relation about .35. Tsui concluded that what she called the information-differ-
ences hypothesis is plausible but that is does not appear to be an important
cause of low interrater agreement across organizational levels.
Harris and Schaubroecks (1988) meta-analysis provided an indirect test of
the different-samples hypothesis. They reasoned that peers and supervisors
disagree in their ratings because peers see more ratee behavior than super-
visors. They asserted that if this is true, self and peer ratings should correlate
more highly than either self and supervisor or supervisor and peer ratings. As
308 HUMAN RESOURCE MANAGEMENT REVIEW VOLUME 7, NUMBER 3. 1997
mentioned, the peer and supervisor ratings correlated much higher than did
the other two combinations, a result negative to the different-samples hypothe-
sis. Harris and Schaubroeck admit, however, that more direct tests of this
hypothesis are needed. It might be pointed out that Hypothesis 2 could be true
if either the amount of information each organizational level observes relative
to each performance dimension is different or the content of the information
related to each dimension is different. It should also be noted that a less
substantive, more psychometric explanation for interrater disagreement is
that raters from different organizational levels might make certain rater er-
rors to a different degree (e.g., leniency, restriction-in-range).
Three final studies offered hypotheses about why raters from different orga-
nizational levels disagree substantially in their performance ratings, but re-
sults of the studies appear relevant to either of our main hypotheses. Berry et
al. (1966) invoked essentially the different-dimensions explanation for their
findings showing greater within-organizational level interrater agreement
than across-level agreement. Zedeck and Baker (1972) found only moderate
convergent validity in performance ratings across first and second line super-
visor raters, concluding that the two levels probably observed different sam-
ples of ratee behavior (Hypothesis 2) and valued at least some work behavior
differently, the latter close to the different-weights version of Hypothesis 1.
And, Klimoski and London (1974) conducted a hierarchical factor analysis of
peer, supervisor, and self ratings, identifying three rating source factors, along
with a general factor and two content factors cutting across the sources. The
emergence of strong rating source factors led Klimoski and London to conclude
that each organizational levels raters may use different dimensions in making
performance judgments.
In sum, evidence for Hypothesis 1, Version 1, the different-dimensions expla-
nation, is mixed, but overall not highly supportive of this explanation of across-
organizational level interrater disagreement. Further, the explanation for in-
terrater disagreement of raters at different levels using different weights for a
similar set of dimensions (Version 2) is not well supported. Regarding Hypoth-
esis 2, very little evidence is available to support or refute this different-
samples explanation for interrater disagreement. Thus, neither of the primary
explanations for across-organizational level interrater agreement has received
strong support. However, it is our position that more research is needed to
investigate further reasons for interrater disagreement across organizational
level. Two possible research directions are discussed next.
RESEARCH ON REASONS FOR INTERRATER AGREEMENT
Application of Personal Construct Theory Techniques. Personal Construct
Theory (Adams-Webber 1979; Kelly 1955) could be used to study directly the
dimensions different organizational level raters use naturally in making per-
formance judgments. Eliciting dimensions from peers and supervisors, for ex-
ample, and then comparing the content of these dimensions, both within- and
360 RATINGS: AN ANALYSIS OF ASSUMPTIONS 309
across-organizational level, would more directly evaluate this aspect of Hy-
pothesis 1 than has been accomplished to date.
An example of how such a study might be carried out was provided by
Borman (1987). He had Army officers focus on the job of NC0 (i.e., first line
supervisor). Each officer independently listed several effective and ineffective
NCOs he or she had worked with, and then used Kellys Repertory Grid Tech-
nique to record the performance constructs he/she believed differentiated be-
tween the effective and ineffective NCOs. After identifying and defining 6-10
constructs apiece (e.g., Firmness-Ability to control personnel and situations
without falling apart), officers rated the similarity of each of their constructs to
each of 49 reference constructs representing ability, personal characteristics,
and performance domains (e.g., Good with Numbers, Energy Level, and Train-
ing Soldiers). The vector of similarity ratings for each performance construct
then reflected a numerical definition of the construct, and these numerical
definitions could be compared (correlated) across officers.
The purpose of the Borman (1987) research was to explore individual differ-
ences between officers in the constructs generated, but clearly this methodol-
ogy could be used to investigate similarities and differences in the dimensions
used across organizational levels as well. A specific hypothesis relevant to the
different-dimensions hypothesis is that correlations between constructs within
organizational level should be significantly higher than between construct cor-
relations across organizational levels.
Expanding Harris and Schaubroecks Moderator Analysis. A second sugges-
tion is to expand the Harris and Schaubroeck meta-analysis to include two sets
of moderators, each representing one of the hypotheses discussed in this paper.
For example, the Hypothesis 1, different-models moderators would index for
each study in the meta-analysis something like how different the roles, orien-
tations, and perspectives of the peer and supervisor organizational levels to-
ward the target ratees are in that organization. This could be accomplished by
defining each moderator on a rating scale (possibly multiple scales) and obtain-
ing ratings on the scales from persons knowledgeable about each organization.
The moderator analysis would then proceed to test the hypothesis that differ-
ent roles, (however defined to best reflect this aspect of Hypothesis 1) affects
the correlations between peer and supervisor ratings.
An analogous approach could be followed with the different-samples hypoth-
esis. A rating scale (or scales) would be developed to obtain estimates for each
organization for which a peer-supervisor interrater agreement estimate is
available of how closely the sampling of ratee behavior is likely to be across the
two rater levels. A moderator analysis could then be conducted to assess the
effects of the Hypothesis 2 variable on correlations between peer and super-
visor ratings. At this point, a comparison should be made to determine the
relative strength of these moderator variables to explain why the peer and
supervisor rating sources disagree in the ratings to the extent they do.
It might also be illuminating to see what the correlation is between the
different moderator variables. This would be important in interpreting results
310 HUMAN RESOURCE MANAGEMENT REVIEW VOLUME 7, NUMBER 3, 1997
of the moderator analyses. For example, there may be substantial correlation
between similarity in role, orientation, and perspective and similarity in the
samplings of behavior observed. Those organizations with peers and super-
visors working closely together (similar samplings of behavior) may tend to
have similar roles, orientations, and perspectives toward members of the ratee
(peer) level. However, the main issue to be addressed with the additional mod-
erator analyses is which of the two hypotheses better explains differences in
levels of peer-supervisor interrater agreement.
The above research directions may be especially helpful for informing on
theoretical and practical issues involved in understanding the reasons for and
conditions under which we can expect across-level interrater agreement. The
personal construct research should provide a more direct test of the different-
dimensions hypothesis and the moderator study would help us understand
better the factors contributing to across level interrater disagreement. How-
ever, we believe it will be most illuminating to focus research on the second
assumption of 360 ratings discussed previously, that ratings from more than
one organizational level enhance the validity of the performance information.
RESEARCH TO EVALUATE THE INCREMENTAL VALIDITY OF RATINGS
FROM MULTIPLE ORGANIZATIONAL LEVELS
Plausible conceptual arguments have been made that low interrater agree-
ment across-organizational level many not necessarily signal a lack of validity
for ratings from these sources (Borman 1974; Campbell, Dunnette, Lawler, &
Weick 1970). In fact, moderate across-level agreement in ratings may indicate
that each levels raters are focusing on performance in somewhat different
performance domains but providing quite valid ratings for the domains being
evaluated. In the context of our earlier discussion regarding the two hypothe-
ses, this outcome might be a function of either using different dimensions or
models or reporting on different samplings of ratee behavior.
Accordingly, I recommend more research to follow up on the conception that
ratings from more than one organizational level may increase the validity of
rating information, when, in fact, there is disagreement across rating levels.
Unfortunately, the research strategies to examine this issue will normally
require longitudinal designs, placing considerable restrictions on the data ap-
propriate for advancement in this area. Nonetheless, we now describe three
possible strategies for testing the central assumption in 360 programs that
ratings from more than one organizational level may be more valid than rat-
ings from a single level.
Correlating Performance Ratings Concurrent/y with Relevant External Criteria.
First, in the perhaps unusual cases where there are external criteria against
which to evaluate the validity of ratings from multiple perspectives, a concur-
rent validation strategy may be possible. In such cases, peer and supervisor
360 RATINGS: AN ANALYSIS OF ASSUMPTIONS 311
ratings can each be correlated separately with the criterion and then pooled to
assess the incremental validity of the combined ratings over ratings from each
source.
Two articles seldom referred to in the performance rating literature provide
some empirical evidence and additional theoretical rationale for a model that
argues for evaluations from two or more different perspectives yielding more
valid judgment than evaluations from only a single source.
Buckners (1959) study is an example of this approach. He had two different
levels of supervisors evaluate the performance of U.S. Navy submariner en-
listed men. Buckner found that in cases where interrater agreement across the
two organizational levels was relatively low, a composite of the ratings corre-
lated higher with scores on objective criteria (argued to be good indicators of
actual performance) than in cases where interrater agreement was higher.
A second study did not employ multiple organizational level raters, but still
provides an example of how evaluators from different perspectives might con-
tribute ratings that when taken together are more valid than any single eval-
uator. Einhorn (1972) had four medical experts with somewhat different spe-
cific areas of expertise predict the survival time of persons who had contracted
Hodgkins Disease. Each made ratings on several components related to the
severity of the disease, as well as an overall prediction of survival time. Multi-
ple regression results using both sets of ratings showed that the validity
against actual survival time was greatest when all four judges data were used,
even allowing for the lower reliability of individuals judges ratings. These
findings seemed to be in part a function of different judges providing valid
information on different components. This is analogous to raters at different
organizational levels each contributing valid performance information, but on
somewhat different aspects of the job. It would be desirable to see more studies
like Buckners and Einhorns in organizational settings with raters from multi-
ple organizational levels and independently derived indexes of overall perfor-
mance .
Predicting Subsequent J ob Performance from Ratings in Training. A second
possible setting for the proposed research direction is in industrial or military
training environments. In cases where peer and instructor evaluations of po-
tential, likely subsequent job performance, or some similar prediction of future
performance are available, correlating these ratings with later overall job per-
formance should address the same hypothesis. In this setting we would predict
that the peer and instructor ratings, taken together (i.e., standardized within
source and then summed), will provide higher validity against subsequent
performance than either of the rating sources itself.
Past research relevant to this hypothesis has focused on the usefulness of
peer ratings in training to predict important criteria in military environments.
Kane and Lawler (1978) reviewed all of the pre-1978 research in this area. For
example, Williams and Leavitt (1947) found that peer ratings of leadership
taken in Marine Corps Officer Candidate School (OCS) were better predictors
of success in OCS and combat performance than either instructor ratings in
312 HUMAN RESOURCE MANAGEMENT REVIEW VOLUME 7. NUMBER 3,1997
training or objective test scores. The predictive validity of peer and supervisor
ratings taken together was not examined.
Similarly, Hollander (1954) gathered leadership peer nominations and in-
structor ratings of cadets in U.S. Navy pre-flight training and showed that the
peer nominations correlated higher (r = .27) with graduation from flight train-
ing 14 months later than did the instructor ratings (r = .18). In this case, it was
possible to compute a multiple R from data contained in the article, and the
results do not support our hypothesis (R = .27). However, it is not clear how
much contact relevant to this criterion the instructors had with students in
this setting. Also, there is some question for our purposes about the similarity
of the leadership rating construct and the construct(s) reflected in the training
pass-fail criterion.
Predicting Subsequent Performance Levels with Ratings of Potential Taken on
the J ob. Finally, perhaps the most relevant set of data for testing, within a
predictive validity framework, the hypothesis that pooled peer-supervisor rat-
ings will be more valid than ratings from either source alone is one in which
peer and supervisor predictor ratings are based on job performance and these
rating dimensions represent constructs conceptually matched with the subse-
quently gathered criteria.
One type of setting where such data may be available is when ratings of
potential or promotability are made by peer and supervisor raters. Of course,
such evaluations are usually generated by supervisors only, but if an organiza-
tion did collect potential or promotability ratings from peers as well, it would
be possible to correlate the ratings against later progress in the organization or
performance at higher organizational levels. The hypothesis would be the
same, the pooled ratings across organizational levels will be more valid than
ratings from either source individually. Similar to assessment center valida-
tion research to predict future performance, it will be important in such a
study that criterion contamination can be ruled out as an explanation for the
results.
CONCLUSIONS
I have argued here that the assumptions underlying the use of multiple source
ratings in 360 programs should be examined and evaluated empirically. One
of the assumptions is that ratings from different organizational levels provide
different, relatively unique perspectives. Research supporting this assumption
is that interrater agreement within organizational level has generally been
found to be higher than across level. We reviewed research on two hypotheses
to explain this result, the different performance models and different samples
of behavior hypotheses. Results do not strongly support either hypothesis. This
is unfortunate because learning why members of different organizational lev-
els tend to disagree in their ratings would help us to interpret 360 ratings.
Two research approaches to making progress in this area were offered.
On balance, practitioners of 360 programs should not be alarmed by low to
360 RATINGS: AN ANALYSIS OF ASSUMPTIONS 313
moderate agreement across rating sources. In my judgment, when possible,
within-level interrater agreement should be evaluated (e.g., for peers, subordi-
nates, or customers). High within-source agreement suggests that the rating
source is providing a coherent, valid (from their perspective) view of perfor-
mance. Practitioners might also keep in mind that within-source agreement
can be increased by obtaining larger numbers of raters.
A second assumption in 360 programs is that each organizational levels
ratings provide incremental validity to performance assessment. Regarding
this assumption, I argued that it is time to resurrect the hypothesis that
moderate or even low interrater agreement across-organization level may re-
flect different levels raters making reasonably valid performance judgments
but on partially different aspects of performance (Borman 1974; Campbell et
al. 1970). It was noted that this hypothesis implies nothing about the hypothe-
ses related to reasons for across organization level interrater disagreement.
Rendering potentially valid performance judgments on partially different ele-
ments of the job could be a function of either different level raters using differ-
ent performance models, their seeing different samples of behavior, or some
combination thereof. Nonetheless, three strategies were discussed for testing
the hypothesis that multiple source ratings taken together may be more valid
than ratings from a single source alone.
If disagreement across rater levels exists and the incremental validity hy-
pothesis is confirmed, support is gained for employing 360 assessment pro-
grams. Certainly, results demonstrating incremental validity for adding rat-
ings from additional rating levels and sources (e.g., subordinates, customers)
would provide much needed empirical support for the administrative and fi-
nancial costs associated with these types of programs. As a final point, it
should be noted that although the research reviewed and evaluated in this
article focused on peer and supervisor rating sources, the same conceptual
arguments and research directions are appropriate for studying other poten-
tial ratings sources as well (e.g., self, subordinates, customers).
ACKNOWLEDGMENT
The author wishes to thank Anne Tsui, Michael Harris, Chris Sager, Mary Ann
Hanson, and Elaine Pulakos for their comments on an earlier version of this
article.
REFERENCES
Adams-Webber, J. R. 1979. Personal Construct Theory: Concepts and Applications. New
York: John Wiley and Sons.
Berry, N. H., P. D. Nelson, and M. S. McNally. 1966. A Note on Supervisory Ratings.
Personnel Psychology 19: 423-426.
Borman, W. C. 1974. The Rating of Individuals in Organizations: An Alternate Ap-
proach. Organizational Behavior and Human Performance 12: 105-124.
314 HUMAN RESOURCE MANAGEMENT REVIEW VOLUME 7. NUMBER 3, 1997
-. 1987. Personal Constructs, Performance Schemata, and Folk Theories of Sub-
ordinate Effectiveness: Explorations in an Army Officer Sample. Organizational
Behavior and Human Decision Processes 40: 307-322.
Borman, W. C., R. M. Mendel, S. E. Lammlein, and R. L. Rosse. 1981. Development and
Validation of a Test Battery to Predict the Job Performance of Transmission and
Distribution Personnel at Florida Power and Light (Institutes Report #70). Min-
neapolis: Personnel Decisions Research Institutes.
Borman, W. C. and E. D. Pulakos. 1986. Army-Wide Data Analyses and Results. In
Development and Field Test Report for the Army-Wide Rating Scales and the
Rater Orientation and Training Program. Submitted to the U. S. Army Research
Institute for the Behavioral and Social Sciences, Alexandria, VA.
Borman, W. C., J. L. Toquam, and R. L. Rosse. 1978. Development and Validation of an
Inventory Battery to Predict Navy and Marine Corps Recruiter Performance (In-
stitute Report #22). Minneapolis: Personnel Decisions Research Institutes.
Borman, W. C., L. A. White, and D. W. Dorsey. 1995. Effects of Ratee Task Performance
and Interpersonal Factors on Supervisor and Peer Performance Ratings. Jour-
nal of Applied Psychology 80: 168-177.
Bracken, D. W. 1996. Multisource (360-degree) Feedback: Surveys for Individual and
Organizational Development. Pp. 117-147 in Organizational Surveys, edited by
A. I. Kraut. San Francisco: Jossey-Bass.
Buckner, D. N. 1959. The Predictability of Ratings as a Function of Interrater Agree-
ment. Journal of Applied Psychology 43: 60-64.
Campbell, J. P. 1990. An Overview of the Army Selection and Classification Project
(Project A). Personnel Psychology 43, 231-239.
Campbell, J. P., M. D. Dunnette, E. E. Lawler, and K. E. Weick. 1970. Managerial
Behavior, Performance, and Effectiveness. New York: McGraw Hill.
Church, A. H. and D. W. Bracken. 1997. Advancing the State of the Art of 360 Feed-
back: Guest Editors Comments on the Research and Practice of Multirater As-
sessment Methods. GFOU~ and Organization Management 22: 149-161.
Dunnette, M. D. 1993. My Hammer or Your Hammer? Human Resource Management
32: 373-384.
Einhorn, H. J. 1972. Expert Measurement and Mechanical Combination. Organiza-
tional Behavior and Human Performance 7: 86-106.
Gunderson, E. K. E. and P. D. Nelson. 1966. Criterion Measures for Extremely Isolated
Groups. Personnel Psychology 19: 67-82.
Harris, M. M. and J. Schaubroeck. 1988. A Meta-Analysis of Self-Supervisor, Self-Peer,
and Peer-Supervisor Ratings. Personnel Psychology 41: 43-62.
Hazucha, J. F., S. A. Hezlett, and R. J. Schneider. 1993. The Impact of 360 Feedback on
Management Skills Development. Human Resource Management 32: 325-351.
Hollander, E. P. 1954. Peer Nominations on Leadership as a Predictor of the Pass-Fail
Criterion in Naval Air Training. Journal of Applied Psychology 38: 150-153.
Kane, J. S. and E. E. Lawler. 1978. Methods of Peer Assessment. Psychological Bulle-
tin 85: 555-586.
Kelly, G. A. 1955. The Psychology of Personal Constructs. New York: Norton.
Klimoski, R. J. and M. London. 1974. Role of the Rater in Performance Appraisal.
Journal of Applied Psychology 59: 445-451.
London, M. and R. W. Beatty. 1993. 360 Feedback as a Competitive Advantage.
Human Resource Management 32: 353-372.
Oppler, S. H. and C. E. Sager. 1992. Relationships among Rated Dimensions of Training
Performance: Assessing Consistency across Organizational Levels. Paper pre-
360 RATINGS: AN ANALYSIS OF ASSUMPTIONS 315
sented at the 100th Annual Convention of the American Psychological Associa-
tion, Washington, DC.
Oppler, S. H., E. D. Pulakos, and W. C. Borman. (forthcoming). Comparing the Influ-
ences of Ratee Characteristics on Peer and Supervisor Ratings.
Pulakos, E. D. and W. C. Borman. 1986. Development and Field Test of Army-Wide
Rating Scales and the Rater Orientation and Training Program (Tech. Rep. No.
716). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social
Sciences.
Pulakos, E. D., N. Schmitt, and D. Chan. 1996. Models of Job Performance Ratings: An
Examination of Ratee Race, Ratee Gender, and Rater Level Effects. Human
Performance 9: 103-119.
Tornow, W. W. 1993. Perspectives or Reality: Is Multi-Perspective Measurement a
Means or an End? Human Resource Management 32: 221-229.
Tsui, A. S. 1984. A Role Set Analysis of Managerial Reputation. Organizational Be-
havior and Human Performance 34: 64-96.
Tsui, A. S. and P. Ohlott. 1988. Multiple Assessment of Managerial Effectiveness:
Interrater Agreement and Consensus in Effectiveness Models. Personnel Psy-
chology 41: 779-803.
Williams, S. B. and H. J. Leavitt. 1947. Group Opinion as a Predictor of Military
Leadership. Journal of Consulting Psychology 11: 283-291.
Zedeck, S. and H. T. Baker. 1972. Nursing Performance as Measured by Behavioral
Expectation Scales: A Multitrait-Multirater Analysis. Organizational Behavior
and Human Performance 7: 457-466.
Zedeck, S., N. Imparato, M. Krausz, and T. Oleno. 1974. Development of Behaviorally
Anchored Rating Scales as a Function of Organizational Level. Journal of Ap-
plied Psychology 59: 249-252.

Das könnte Ihnen auch gefallen