Marsh (1982) - Validity of Students' Evaluations of College Teaching - A Multi-Trait-Multi-Method Analysis

Journal of Educational Psychology 1982, Vol. 74, No. 2, 284-279.
Copyright 1982 by the American Psychological Association, Inc. 0022-0663/82/7402-0264$00.75
Validity of Students' Evaluations of College Teaching: A Multitrait-Multimethod Analysis

Herbert W. Marsh University of Sydney, Australia College instructors in 329 classes evaluated their own teaching effectiveness with the same 35-item rating form that was used by their students. There was student-instructor agreement in courses taught by teaching assistants (r = .46), undergraduate courses taught by faculty (r = .41), and even graduate level courses (r = .39). Separate factor analyses of the student and instructor ratings demonstrated that the same nine evaluation factors underlay both sets of ratings. A multitrait-multimethod analysis provided support for both convergent and divergent validity of these rating factors. Not only were correlations between student and instructor ratings on the same factors statistically significant for each of nine factors (median r = .45), but correlations between their ratings on different factors were low (median r = .02). These findings demonstrate student-instructor agreement on evaluations of teaching effectiveness, support the validity of student ratings for both graduate and undergraduate courses, and emphasize the importance of using multifactor rating scales that are derived through the application of factor analysis. Students' evaluations of teaching effectiveness are often criticized as lacking validity. However, the ratings are difficult to validate because there is no universal criterion of good teaching. Researchers, using a construct validation approach, have sought to demonstrate that student ratings are related to a variety of other measures assumed to be indicative of effective teaching. If two indicators of the same qualitystudent ratings and some other measure of teaching effectivenessshow agreement, then there is support for the validity of both. The most commonly employed criterion has been performance on a final examination in a multisection course. When different sections of the same course were taught by different instructors, the sections that did best on the standardized examination given to all sections were also the ones who evaluated their instructors more favorably (Centra, 1977; Frey, 1973, 1978; Frey, Leonard, & Beatty, 1975; Marsh, Fleiner, & Thomas, 1975; Marsh & Overall, 1980). Using a similar procedure Marsh and Overall (1980) demonstrated that sections that evaluated teaching more favorably were also the ones that felt better able to apply the course materials and were more inclined to pursue the subject further. Other researchers have reported good agreement between student ratings and both the retrospective ratings of recent alumni (Centra, 1973; Marsh, 1977) and the follow-up evaluations of the same students several years after graduation (Overall & Marsh, 1980). Validity research such as that described above, although supporting the use of student ratings, has generally been limited to a specialized setting (e.g., large multisection courses) or has employed criteria (e.g., student reports) that are unlikely to convince sceptics. Thus, faculty will continue to question the usefulness of student ratings until validity criteria that are both convincing and applicable across a wide range of courses are utilized. One criterion that meets both these requirements is instructors' evaluations of their own teaching effective264
This study was also the basis of a paper presented at the 1979 Annual Meeting of the American Educational Research Association. (Irateful acknowledgments are extended to John Schutz, Joseph Kertes, and Robert Linnell, administrators at the University of Southern California who have supported the student evaluation program. Thanks are also extended to J. U. Overall and Xoe Cosgrove, who made many helpful suggestions on earlier drafts of this paper. Requests for reprints should be sent to Herbert W. Marsh, Department of Education, University of Sydney, Sydney, New South Wales 2006, Australia.
VALIDITY OF STUDENTS' EVALUATIONS
265
ness. This criterion should be acceptable to most faculty as one indicator of effective teaching, and it can be applied in all instructional settings. Furthermore, instructors can be asked to evaluate their own teaching along the same dimensions employed in the student rating form, thereby testing the specific validity of the different rating factors. In spite of the apparent appeal of instructor self-evaluations as a criterion for assessing student ratings, it has had limited application. Centra (1973) found correlations of about .20 between faculty self-evaluations and student ratings, but both sets of ratings were collected at the middle of the term as part of a larger project that examined the impact of feedback from student ratings. Blackburn and Clark (1975) also reported correlations of about .20, but they only asked faculty to rate their teaching in a general sense rather than to rate their teaching in a specific class that was also evaluated by students. In contrast, higher correlations have been demonstrated in four other studies. Webb and Nolan (1955) reported a correlation of .62 between student ratings and 51 instructor self-evaluations at a naval training school. Doyle and Crichton (1978) found a median correlation of .47 between student ratings and self-ratings of teaching assistants in 10 sections of a multisection course. Braskamp, Caulley, and Costin (1979) reported median correlations of .31 and .65 when student ratings were correlated with the self-evaluations of 17 teaching assistants for two successive semesters. Marsh, Overall, and Kesler (1979) asked faculty in 83 undergraduate classes to evaluate their own teaching with the same 27-item rating instrument completed by their students. Mean differences between student and faculty responses were small, factor analysis demonstrated that the same evaluation factors underlay both sets of ratings, and the median correlation between student and faculty ratings was .49. The Marsh, Overall, and Kesler (1979) study served as a basis for the present research. This study, though a replication of the earlier one, differs from it in three important aspects. First, several new evaluation factors were added to the survey instrument. Second, the study was expanded
to include courses taught by teaching assistants and graduate level courses as well as undergraduate courses taught by faculty. The validity of student ratings can be explored across all the courses and separately within each of the three sets. Third, the sample size was increased to include a total of 329 courses. This permits detailed application of multitrait-multimethod analyses that test for both convergent and divergent validity. Convergent validity, which is typically considered, is the correlation between student and instructor ratings on the same evaluation dimensions. Divergent validity refers to the distinctiveness of the different evaluation factors. The demonstration of divergent validity would provide powerful evidence against the common practice of using a single overall summary item or a simple average across a broad set of evaluation items. Method
During the academic year 1977-1978, student evaluations were collected in virtually all courses offered in the Division of Social Sciences at the University of Southern California. Evaluations were administered shortly before the end of the term, generally by a designated student in the class or by a staff person. Students were told that the evaluations would provide feedback to instructors and would be considered as part of personnel decisions. The surveys were completed by an average of 76% of the students enrolled in each class. The evaluation instrument consisted of 35 evaluation items adapted from Hildebrand, Wilson, and Dienst (1971), Marsh, Overall, & Thomas (1976), and Marsh et al. (1979). The median reliability of individual evaluation itemsintraclass correlation coefficients based on sets of responses from 25 students per classwas .88. A factor analysis of the student ratings of all undergraduate courses taught by regular faculty revealed nine separate evaluation factors. The reliability of the factors, coefficient alphas, varied from .88 to .97. Instructor self-evaluation surveys were sent to all teachers who had been evaluated by students in at least two different courses during the same term. Instructors were asked to evaluate the effectiveness of their own teaching in both courses. These surveys were completed after the end of the term but before summaries of the student evaluations were returned. Although participation was voluntary, a cover letter from the dean of the division strongly encouraged cooperation and guaranteed confidentiality of each teacher's response. Instructors evaluated both courses with a set of items identical to those used by students, except that items were worded in the first person. They were specifically asked to rate their own teaching effectiveness as they perceived it (even if they felt that their students would
266
HRRBERT W. MARSH
disagree) and not to report how their students would rate them. Instructors were also asked to respond to attitudinal items about student ratings, potential biases in these ratings, and other possible procedures for evaluating- teaching effectiveness. A total of 181 (78%) surveys were returned. Since most faculty evaluated the effectiveness of their teaching in two different classes, self-ratings were completed in a total of 329 different classes: 183 undergraduate courses taught by faculty, 45 graduate level courses, and 101 courses taught by teaching assistants (teaching assistants did not teach any graduate level courses). Eleven evaluation scoresfactor scores representing the nine evaluation factors and overall ratings of the teacher and the coursewere used to summarize the student ratings and instructor self-evaluations. Evaluation factor scores were weighted averages of standardized responses to each evaluation item. The weights, factor score coefficients, were derived from a previous factor analysis of a larger sample of undergraduate courses (Marsh & Overall, 1979). The evaluation scores and a brief description are as follows: (a) Learning/Value = the extent to which students felt they encountered a valuable learning experience that was intellectually challenging, (b) Instructor Enthusiasm = the extent to which students perceived the instructor to display enthusiasm, energy, humor, and an ability to hold their interest, (c) Organization = the instructor's organization of the course, course materials, and class presentations, (d) Group Interaction = students' perceptions of the degree to which the instructor encouraged class discussions and invited students to share their own ideas or to be critical of those presented by the instructor, (e) Individual Rapport = the extent to which students perceived the instructor to be friendly, interested in students, and accessible in or out of class, (f) Breadth of Coverage = the extent to which students perceived the instructor to present alternative approaches to the subject and to emphasize analytic ability and conceptual understanding, (g) Examinations = students' perceptions of the value and fairness of graded materials in the course, (h) Assignments = the value of class assignments (readings, homework, etc.) in adding appreciation and understanding of the subject, (i) Workload/Difficulty = students' perceptions of the relative difficulty, workload, pace of presentations, and number of hours required by the course, (j) Overall Course = a single item asking students to compare the course with other courses, (k) Overall Instructor = a single item asking students to compare the instructor with other instructors.
for a more detailed description). A majority (59%) of the faculty indicated that some measure of teaching quality should be given more emphasis in promotional decisions. Faculty clearly agreed (80%) that student ratings are useful to the faculty themselves as feedback, and a majority (52%) even agreed that the ratings should be publicly available for students to use in course selections. However, only 38% felt that student ratings really represent an accurate assessment of instructional quality, and they were even more critical about using classroom visitation by peers or faculty selfevaluations in promotional decisions. Faculty were also asked to indicate the items in a list of 17 "potential biases" they believed would actually have a substantial impact on student ratings. The most commonly mentioned were course difficulty (72%), grading leniency (68%), instructor popularity (63%), student interest in subject before course (62%), course workload (55%), class size/enrollment (55%), and required versus elective (55%). A dilemma clearly exists. Faculty are concerned about teaching effectiveness, even to the extent of wanting it to play a more important role in their own promotions. However, many expressed doubts about each of the measures of teaching effectiveness that were suggested, including student ratings. In particular, faculty suggested a number of sources of potential bias in the student ratings. Before the potential usefulness of student ratings can be realized, faculty and administrators have to be convinced that student ratings are valid and relatively free of bias. Factor Analysis Separate factor analyses were performed on student and instructor self-ratings of the 35 evaluation items (See Table 1) to determine whether the evaluation factors underlying student evaluations were similar to those representing instructor self-evaluations. Both confirm the nine evaluation factors that have previously been identified. Each item, for both student and instructor ratings, loads highest on the factor it is designed to measure. Loadings for items de-
Results Faculty Attitudes Toward Student Ratings As part of the study, faculty were asked to express their agreement or disagreement with statements concerning student ratings, potential biases to student ratings, and other possible methods of evaluating the quality of their teaching (see Marsh & Overall, 1979,
VALIDITY OP STUDENTS' EVALUATIONS
267
fining each factor are generally at least .40 and usually exceed .50. All other loadings are less than .30 and are usually less than .20. The similarity in the two factor patterns implies that similar dimensions underlie both student and instructor ratings of effective teaching. The results of both factor analyses are also quite similar to results of a previous factor analysis performed on a larger sample of student ratings in just undergraduate courses taught by faculty (Marsh & Overall, 1979). Several analytical techniques are available for the comparison of different factor analyses, but none have been thoroughly developed (Levine, 1977). Among other procedures Levine suggests correlating the factor loadings. In the present application, each factor pattern (see Table 1) has 315 factor loadings; each of 35 items has loadings on each of the nine factors. Factor loadings for the factor analysis of instructor self-ratings correlate .90 with both the loadings from the factor analysis of student ratings in this study and the previous analysis of student ratings in just undergraduate courses taught by faculty; loadings from the two factor analyses of student ratings correlate .95 with each other. These results also confirm the similarity of the factor patterns resulting from student and instructor ratings. In summary, these analyses demonstrate the replicability of the student rating dimensions and their generality across different methods of evaluation. Convergent and Divergent Validity: Campbell-Fiske Analysis Campbell and Fiske (1959) advocate the assessment of validity by determining measures of more than one trait, each of which is assessed by more than one method. In the present application, the multiple traits are the nine evaluation factors, and the multiple methods are the two distinct groups of raters: students rating their instructors and the instructors rating themselves. Convergent validity, the most typically determined, is the correlation between the same evaluation factors rated by two different groups. Discriminant validity refers to the distinctiveness of each of the evaluation factors.
Campbell and Fiske (1959) argue that the importance of considering multiple traits as separate entities can be tested by determining their discriminant validity. Two different aspects of discriminant validity are particularly relevant to the present application. The first examines whether student-instructor agreement on each factor is independent of agreement on other factors. For example, if a single "generalized rating factor" underlies both student and instructor ratings, then agreement on any particular factor might be a function of agreement on the generalized factor and not have anything to do with the specific content of the factor being considered. As a consequence, correlations between student and instructor ratings on the same factor would be high, but so would the correlations between their ratings on different factors. The second point of discriminant validity considers the possibility that the relationship between different factors as rated by the same group of raters is due to the method of data collection rather than to true relationships between the underlying dimensions being considered. The most likely source of this method variance in the present application would be a halo effect. Convergent and discriminant validity across all courses was determined by examining the correlation matrices in Table 2. The correlations between different evaluation factors as assessed by the same group of raters are contained in the two triangular matrices: intercorrelations among instructor self-evaluation factors (upper left) and student evaluation factors (lower right). The diagonals of these triangular matrices contain the reliabilities of the factors, the coefficient alphas, for each group of raters. The square matrix (lower left) contains the correlations between student evaluation factors and instructor self-evaluation factors. The diagonal of the square matrix (the convergent validity coefficients) contains correlations between the same evaluation factors as assessed by the two different groups. Since there is unreliability in both the student ratings (median reliability = .94) and the instructor self-evaluations (median reliability = .82), the convergent validity coefficients have been corrected for unreli-
to
o>
0=
Table 1 Factor Analyses of Students' Evaluations of Teaching Effectiveness and the Corresponding Faculty Self Evaluations of Their Own Teaching In All 329 Courses
Factor pattern loadings
Evaluation items (paraphrased) 1. Learning/Value Course challenging/stimulating Learned something valuable Increased subject interest Learned/understood subject matter Overall course rating 2. Enthusiasm Enthusiastic about teaching Dynamic & energetic Enhanced presentations with humor Teaching style held your interest Overall instructor rating 3. Organization Instructor explanations clear Course materials prepared & clear Objectives stated & pursued Lectures facilitated note taking 4. Group Interaction Encouraged class discussions Students shared ideas/knowledge Encouraged questions & answers Encouraged expression of ideas 5. Individual Rapport Friendly toward students Welcomed seeking help/advice Interested in individual students Accessible to individual students
Std
42 53 57 55 36
Ins
40 77 70 52 33
Std 23 15 12 12 25 55 70 66 59 40 07 03 -05 20 10 06 06 02 17 05 11 -11
Ins 25 02 05 12 29 42 70 58 64 54 24 -02 -08 09 02 -07 09 06 06 02 09 -11
Std 09 10 08 13 16 16 15 -04 23 23 55 73 49 58 01 -04 14 01 00 02 00 16
Ins -10 -02 07 12 09 00 01 06 20 09 42 69 41 53 03 -01 06 -11 -06 07 01 09
Std 04 09 08 05 12 07 11 05 16 14 20 09 03 -17 84 85 62 73 13 06 14 09
Ins 04 04 07 03 08 02 06 01 06 08 09 01 05 07 86 88 69 75 12 00 07 -02
Std 00 01 02 03 09 21 08 13 06 23 05 10 08 -02 03 05 16 20 68 85 69 62
Ins -03 01 -03 11 02 15 05 02 00 02 04 -02 05 05 00 13 -02 09 78 75 77 43
Std 15 10 18 02 12 10 06 12 03 11 10 09 14 14 00 05 15 05 -01 -04 -01 20
Ins 27 00 08 -01 16 00 05 02 14 16 06 04 08 04 00 01 03 07 -05 04 -09 25
Std 09 10 03 19 13 05 07 14 10 10 13 06 25 15 06 03 07 09 13 12 14 08
Ins 05 04 -04 07 -08 16 16 07 05 -08 01 03 27 06 00 -02 11 12 02 06 03 13
Std 16 17 19 14 14 01 01 02 06 05 06 10 06 08 06 08 08 05 10 05 08 00
Ins 23 09 05 -04 27 09 05 -18 03 27 23 03 05 01 -05 -10 21 09 -05 20 -09 14
Std 29 16 14 -23 08 05 06 -07 -02 05 -08 01 06 -04 00 -02 00 00 -07 03 03 04
Ins 20 06 -02 -11 16 06 03 -10 -03 16 -03 12 06 -05 -03 01 01 -02 01 -04 09 07
15 08 10 09 12 12 06 19 -03 04 02 03 07 04 04 07 02
29 03 04 12 27 00 06 12 02 06 08 04 01 10 -10 10 13
%
C/J
Table 1
(continued)
Factor pattern loading 3 1 2 Ins 02 03 -06 29 01 02 00 09 01 00 -04 07 00 Std 12 08 04 08 08 00 -01 -03 -01 06 -09 12 07 Ins 01 10 09 -04 09 -03 04 -03 -12 Std 05 16 11 -04 06 03 11 03 01 3 Ins 03 07 11 -04 -11 14 21 07 04 -05 02 18 00 Std 08 -03 08 05 09 07 01 -01 09 -04 07 -12 07 4 Ins 01 -02 16 12 05 06 01 -06 21 02 05 -09 02 Std -03 02 06 09 08 14 06 03 01 -01 00 06 00 5 Ins 01 -02 01 00 12 00 00 01 17 00 04 02 02 S td
72 71 72 50 6 Ins 84 78 55 48 03 17 -04 -07 08 00 01 -07 03 7 8 9 Ins 02 -01 -06 10
Evaluation items (paraphrased) 6. Breadth of Coverage Contrasted implications Gave background of ideas/concepts Gave different points of view Discussed current developments 7. Examinations/Grading Examination feedback valuable Eval. methods fair/appropriate Tested emphasized course content 8. Assignments Readings/texts valuable Added to course understanding 9 Workload/Difficulty Course difficulty (easy-hard) Course workload (light-heavy) Course pace (too slow-too fast) Hours/week outside of class
Std -05 08 04 23 -03 06 08 -06 12 -06 14 -20 14
Std
08 01 07 06 72 69 70 01 07 -04 00 04 03
Ins
-03 08 17 05 62 64 58 11 05 08 01 08 -08
Std 14 11 01 16
Std
Ins
O
08 -06 03 03 04 08 -01 -02 09 -08 -02

02 06 85 88 62 73
- 04 10 11
07 - 02 08 06 --03 - 04
05 11 07
91 81 10 00 05 05
-03 11 10
70 56 04 -04 -04 21
03 04 -03
04 10 74 86 32 46
Z CO
-01 04 -01 03 00 04 00-11
Note. Factor loadings in boxes are the loadings for items designed to measure each factor. All loadings are presented without decimal points. Factor analyses of student ratings (Std) and instructor self-evaluations (Ins) consisted of a principal-components analysis, Kaiser normalization, and rotation to a direct oblimin criterion. The first nine unrotated factors for the instructor self-ratings had eigenvalues of 9.5, 2.9, 2.5, 2.2, 2.0,1.4,1.3,1.1, & 1.0 and accounted for 68% of the variance. For the student ratings the first nine eigenvalues were 19.9, 3.3, 2.3, 1.5, 1.2, .9, .7, .6, & .5 and accounted for 88% of the variance. The analyses were performed with the commercially available SPSS routine (see Nie et al., 1975).
Ol CO
CO
270
HERBERT W. MARSH
ability.1 The set of matrices in Table 2, referred to as a multitrait-multimethod matrix, is based on the combined data of all three sets of classesthose taught at the graduate and the undergraduate levels by regular faculty and those taught by teaching assistants. Multitrait-multimethod matrices were also constructed separately for each of the three sets of classes. Campbell and Fiske (1959) have proposed four criteria for assessing convergent and divergent validity: 1. The convergent validity coefficients (student-instructor agreement on the same rating factors) should be statistically significant and sufficiently different from zero to warrant further examination of validity. Failure of this test indicates that the different methods are measuring different constructs and implies a lack of validity in at least one of the methods (i.e., student or instructor ratings). 2. The convergent validities should be higher than the correlations between different traits (the different rating factors) assessed by different methods. The failure of this test implies that agreement on a particular trait is not independent of agreement on other traits, perhaps suggesting that the agreement can be explained in terms of a generalized agreement that encompasses more than one (or all) of the traits. 3. The convergent validities should be higher than correlations between different traits assessed by the same method. If the convergent validities are not substantially higher, there is the suggestion that the traits may be correlated, that there is a method effect, or some combination of these possibilities. If the correlations between different traits assessed by the same method approach the reliabilities of the traits, then there is evidence of a strong halo or method bias. 4. The pattern of correlations between different traits should be similar for each of the different methods. Satisfaction of this criterion, assuming that there are significant correlations among traits, would suggest that the underlying traits are truly correlated. Failure to meet this criterion implies that the observed correlation between traits assessed by a given method is due to a method or halo bias.
The first Campbell-Fiske criterion, a test of convergent validity, requires that the diagonal values of the square matrix be substantially higher than zero. Inspection of Table 2 indicates that this was the case for all the evaluation factors. Convergent validity coefficients varied between .17 and .69 (median r = .45), and all were statistically significant. Convergent validity coefficients were also determined separately for each of the three sets of courses. The median coefficients were .41 (faculty-taught undergraduate courses), .39 (graduate level courses), and .46 (undergraduate courses taught by teaching assistants). Only four of these 27 validity coefficients failed to reach statistical significance: three of the nine coefficients for graduate level courses and one of the nine for courses taught by teaching assistants. The second Campbell-Fiske criterion requires that each convergent validity coefficient be higher than any other correlation in the same row or column of the square matrix. This test requires that each of the nine convergent validity coefficients be compared with each of 16 other coefficients, a total of 144 comparisons in all (see Footnote 1). Across all courses (see Table 2) this criterion was satisfied for 143 of the 144 comparisons, providing good support for this aspect of divergent validity. With few exceptions, this criterion was also met when data from each of the three sets of courses were considered separately. Examinations/Grading was the only factor that did not consistently pass this test; it satisfied this criterion for only the graduate level courses. The third criterion requires that each convergent validity coefficient be higher than correlations between that factor and any other factor evaluated by the same group of raters. For example, the convergent validity coefficient for Enthusiasm was (a)
For purposes of testing the second and third guidelines proposed by Campbell and Fiske, no correlations (not even the convergent validity coefficients) were corrected for unreliability. In an unreported analysis, these two guidelines were applied to the multitrait-multimethod matrix (Table 2) in which all correlations had been corrected for unreliability. In this case there was somewhat stronger support for divergent validity and somewhat less indication of a method/halo bias.
1
Table 2 Multitrait-Multimethod Matrix: Correlations Between Student and Faculty Self-Evaluations in All 329 Courses
Instructor self-evaluation factor Factor Instructor self-evaluation 1. Learning/Value 2. Enthusiasm 3. Organization 4. Group Interaction 5. Individual Rapport 6. Breadth 7. Examinations 8. Assignments 9. Workld/Difficulty Student evaluation 10. Learning/Value 11. Enthusiasm 12. Organization 13. Group Interaction 14. Individual Rapport 15. Breadth 16. Examinations 17. Assignments 18. Workld/Difficulty
1
83 29 12 01 -07 13
Student evaluation factor
10
11
12
13
14
15
16
17
18
>
-01
24
03 46
21 17
82 01 03 -01 12 08 -01 -01
74 -15 07 13 26
17 12 -01 -04 30 -20 -05 09 01 02 04
90 02 11
09 05 -09 08 -01 -03 52 13 00 -01 09 00
82 -01 15 22 06
84
20
09 -04 09
76
22
09
-04
-03
70 21
70
19 03 26 18 20 -06
10 54 13 05 03 15 09 03 -03
-12 -02 04 00 28 -14 06 -01 03
-01
07 -02
09
-14
-19 42 -09
04 -03
-03 00
17 -01 12
08 -09 00 -04 -02 09 -02 45 22
02 -09 -05 -08 00 02 -06 12 69
95
45 52 37 22 49 48 52 06
96 49 30 35 34 42 21 02
93 21 33 56 57 34
98 42 17 34 30
ts > r
96 15 50 29 94 33 40
o
93 42
92
cc
-05
-05
08
18
-02
20
87
Note. All correlation coefficients are presented without decimal points. Correlations greater than .10 are statistically significant. Values in the diagonals of the upper left and lower right matrices, the two triangular matrices, are reliability (coefficient alpha) coefficients (see Nie et al., 1977). Values in the diagonal of lower left matrix, the square matrix, are convergent validity coefficients that have been corrected for unreliability according to the Spearman Brown equation. The nine unconnected validity coefficients, starting with Learning, would be .41, .48, .25, .46, .25, .37, .13, .36, & .54.
272
HERBERT W. MARSH
higher than correlations between instructor ratings of Enthusiasm and any other instructor rating factor and (b) higher than correlations between student ratings for Enthusiasm and any other student rating factor. For the instructor self-evaluations, application of this criterion yielded only 4 rejections (out of 72 comparisons). Each of these failures involved the Examination factor. For student ratings, however, there were 30 rejections (also out of 72 comparisons). The majority of these failures involved the Learning, Organization, and Examination factors. The separate application of this criterion to each of the separate groups of classes revealed similar findings: there were relatively few rejections for instructor self-evaluations, whereas nearly half the comparisons involving student ratings resulted in rejections. Although alternative explanations do exist, these findings suggest that there is some method or halo effect in the student ratings. The fourth criterion requires that the pattern of correlations among the student rating factors be similar to the pattern among the instructor self-evaluation factors. A visual inspection of Table 2 suggests that this is the case. To provide a more precise test, the 36 off-diagonal coefficients in the instructor self-evaluation triangle were correlated with those in the student rating triangle. The result (r = .43) was significant at the .01 level and indicates that there is a similarity in the pattern of correlations. The satisfaction of this criterion suggests that at least part of the supposed "method or halo effect" actually represents true relationships among the factors that is independent of method. The interpretation of a method or halo effect in the student ratings (suggested by the application of the third criterion) must be mitigated by two factors. The first involves the reliability of the data. The application of the third criterion implicitly assumes that both student and instructor ratings are equally reliable, and there is comparable attenuation due to unreliability in correlations among student ratings, in correlations among instructor self-ratings, and in correlations between student and instructor ratings. However, the student
ratings are more reliable (median coefficient alpha = .94) than are the instructor selfevaluations (median coefficient alpha = .82). Consequently, correlations among student evaluations are systematically inflated (i.e., less attenuated) relative to the convergent validity coefficients, and particularly relative to correlations among instructor self-evaluations. The higher correlations among the student rating factors that were interpreted as method/halo effect were at least partially caused by the higher reliability of the student ratings. The second complication in the interpretation of a method effect involves the possibility of true relationships among the ratingfactors. If the different rating factors are truly independent, then any correlation among the rating factors beyond that expected by chance alone could be interpreted as a method or halo effect. For example, Thorndike (1920) suggests that there should be little or no true correlation between a teacher's intelligence and the quality of his or her voice and that the obtained correlation of .63 between ratings of these attributes clearly suggests a halo effect. However, there is no such logical basis for assuming that the instructional evaluation factors are uncorrelated, and the application of the fourth criterion suggests that the underlying dimensions actually are correlated (i.e., a relationship that is independent of method). Consequently, at least part of the correlation among the rating factors apparently represents a true relationship that is not due to a method or halo effect. In summary, the Campbell-Fiske analysis provides clear support for convergent validity, and two of the criteria of discriminant validity. There was student-instructor agreement on each of the rating factors, the agreement on any one factor was independent of agreement on other factors, and the pattern of correlations among the factors was similar for both student and instructor ratings. Furthermore, the extent of this support was reasonably consistent across each of the three sets of courses considered separately. However, though alternative explanations do exist, there was also an indication of some halo effect in the student ratings. The convergence argues for the
273
Table 3 Computational Equations for the Analysis of Variance of the Multitrait-Multimethod Matrices Source Class (C) (convergent validity) Class X Trait (T) (discriminant validity) Class X Method (M) (method/halo effect) CXTXM (error)
SS
df
Variance component (MSc MSctm)/nm (MSct -- Msctm)/m (MScm - MSctm)/n Msctm
Nnm (rt) Nnm (rv -rt) Nnm (rf -rt) Nnm ( 1 --ru -rf+rt)
N-:1
(N- 1) (n- 1) (N- 1) (m- 1) (N- D (n- l ) ( m - l )
Note. N = Total number of cases (classes); re = Number of different traits; m = Number of different methods; rt = average correlation coefficient in the entire MTMM matrix (including coefficients both above and below the diagonal and values of 1.0 in the diagonals; ru = average correlation between sources within traits, computed by: [2 (sum of validity diagonals) + mn]/n m2; rf = average between-traits correlation in the monomethod blocks, computed by: [2 (sum of monomethod-heterotrait triangle coefficients) + nm]/mn 2 . F ratios for each of the three effects (convergent validity, divergent validity, and method/halo effect) are obtained by dividing the mean squares for the effect (the sum of squares divided by the degrees of freedom) by the mean square of the error term.
bias. The model was originally developed by Stanley (1961), but it appears to have been popularized by the demonstration (Kavanagh, MacKinney, & Wolins, 1971) that the computations could be performed Convergent and Divergent Validity: directly on the multitrait-multimethod Analysis of Variance matrix of correlations. Stanley (1961) ilDespite the intuitive appeal of the lustrated that when repeated measurements Campbell-Fiske criteria, there are numerous of casesratings of college classes in the problems in their actual application (Alwin, present exampleare measured over all 1974). Although many were anticipated by levels of two other variablestraits (rating Campbell and Fiske (1959), solutions were factors) and methods (rating groups)three not offered. One obvious problem is the lack orthogonal sources of covariation can be esof specification as to what constitutes satis- timated. The main effect due to classes is factory results. The application of their first a test of how well ratings in general distwo criteria of discriminant validity requires criminate between classes and is taken as a that each of the nine convergent validities be measure of convergent validity (this is not compared with 32 different correlations for the same sense of convergent validity ina total of 288 comparisons. This was per- ferred by Campbell & Fiske). The interacformed for the combined data and then three tion between classes and traits tests whether more times for data based on each set of the differentiation between classes depends classes. Besides being unwieldy, the likeli- on traits. If it does not, then the traits have hood of obtaining rejections due to sampling no differential or discriminant validity (i.e., fluctuations alone increases geometrically the classes are ranked the same regardless of with the number of factors and methods. the rating factor). This is taken to be a The user is then left the task of summarizing measure of discriminant validity. The inall these comparisons in a way that repre- teraction between classes and methods tests whether the differentiation between sents the degree of discriminant validity. An alternative to the Campbell-Fiske classes depends on methods. If it does, then analysis is based on an analysis of variance the different methods introduce a source of (ANOVA) model. This model offers a sum- systematic (undesirable) variance. This is mary statistic and a test of statistical sig- taken to be a measure of method or halo efnificance for the effects of convergent va- fect. The Class X Trait X Method interaclidity, divergent validity, and method/halo tion is assumed to measure only random
validity of student ratings, whereas the divergence demonstrates the importance of considering the multiple rating scales separately.
274
HERBERT W. MARSH
a a, >0
CD '*
^ CD CN CM Oi 00
error. The principal advantages of the ANOVA model are its ease of application and the convenient descriptive statistics. However, the approach also has shortcomings and there is not a clear equivalence between the effects generated by the ANOVA model and the Campbell-Fiske criteria (Marsh & Hocevar, 1980; Schmitt, Coyle, & Saari, 1977). A summary of the computational procedures and each of the ANOVA effects (convergent validity, divergent validity, and method/halo bias) are presented in Tables 3 and 4. Across all of the courses, each of the effects is statistically significant. However, the magnitude of the discriminant validity effect (the variance component) is nearly twice the size of both the convergent validity effect and the method/halo effect. This same analysis was applied to each of the three sets of courses separately, and these results (see Table 4) also demonstrate a similar pattern of findings. There was some evidence for slightly higher levels of method/halo effect than when the comparison was made across all courses, but the discriminant validity effect was substantially higher in all cases. As with the Campbell-Fiske analysis, the systematically higher reliability of the student ratings produced biased estimates of the ANOVA effects (Boruch, Larkin, Wolins, & MacKinney, 1970; Schmitt, Coyle, & Saari, 1977; Marsh & Hocevar, 1980). Following the suggestion of Heberlein (Note 1; also see Althauser & Heberlein, 1970), each correlation in the multitrait-multimethod matrix (Table 2) was corrected for attenuation, and the ANOVA was applied to the corrected matrix. As a consequence of the smaller error term resulting from the correction for attenuation, each of the three effects increased in size (see Table 4). However, the increase in the divergent validity effect was larger than the increase in any of the other effects. In summary, interpretations based upon the ANOVA are quite similar to those resulting from the Campbell-Fiske analysis. There was good support for both the convergent and divergent validity of the instructional evaluations, but there was also an indication of some method/halo effect. However, the size of the divergent validity
CD oi oo a>
ii CO T-H CO
O
r-j
cO
fl
CO CO
cO
CO
Ss
fc
oo -^ oo -*
CN CM CM CM CO C^ CO CD
I I
Si o
M-O
S a, >O
00 00 C- O^ CM Ol O CO rH CO CO Tt1
ca
<C
O O CO ^_
o>t*
o co <d oo
^ Z
CO
00 O CM tO CM C^ CO CO
1
a) '^ co~ S
Q)
O
2
^
CM CD CN CD O r~I O i-H .-H 00 ^-1 00
U3
CD -
CO i
> i 1
>0
CM CM CD Ti 1 -rr ii O O T-H CO CM iO
S 3
^3 O
CU
4 rQ
0 cfl
CO CO
CN] UO Ol CO ^ iO CO CD
3 '-3 r J3
^i
^ 'S
cc
2 -M
3
co OT o tr-
_o
4~i
O -2
s^
"O
O -g 1
i j
CO 0)
> 0
rH O CO O <* CO O C t-4 CO (N "tf
3
, i
o 0 -g 3 S nj <2 i^
- -^
3 -&
CO CM OO CO
CO CO
O t.
bi) ') LH ~j
^ ^ C3 ^
.11
^
(H
'-j
"3 -g
"cT <i>
CO CO
i bo 3
>o
3
cci CX
-*-
e 2 'i t*
&5
8
13
CO CO
CO CO
CD CO CO ' 0$ CD T-H ' O rH CD (
O O
1 1
-*~i
fc
_O
a> 13
CD " S
O.o
"2
o p
c?
1
275
effect was about twice that of the method/ halo effect, and even larger when the results were corrected for attenuation. A similar pattern of findings was observed when each of the three sets of courses was considered separately. Student-Instructor Agreement Absolute and Relative Results of the multitrait-multimethod analysis indicate that factors of student ratings are consistently correlated with the corresponding instructor self-evaluation factors. These findings, however, only demonstrate relative agreement and not absolute agreement. For example, if all instructors consistently rated themselves exactly one category higher than did their students on each rating item, there would be perfect relative agreement (a correlation of 1.0) even though there would not be absolute agreement. The purpose of analysis in this section is to test the relative and absolute agreement between instructors and students on individual evaluation items. Students evaluate teaching effectiveness to be slightly better than do their instructors. Excluding the Workload/Difficulty items that do not vary along the same scale as the other items (1 = "Very poor" to 5 = "Very good"), the median rating for the remaining 31 items is 3.9 for instructors and 4.0 for students (see Table 5). However, across these items there are no differences between students and instructors in the evaluation of undergraduate courses taught by faculty (median ratings of 4.0) or courses taught by teaching assistants (median ratings of 3.8). It is only in graduate level courses that student ratings are slightly higher than faculty self-evaluations (median ratings of 4.2 vs. 4.1). Both students and instructors judge teaching effectiveness to be slightly better in graduate level courses and slightly poorer in courses taught by teaching assistants. Although student and instructor responses are similar when averaged across all items, there are differences for specific items. Students rated instructors more favorably than the instructors rated themselves on 10 items, whereas teachers evaluated themselves more favorably on 5 items. Students generally gave more favorable responses to
items within the Learning/Value, Enthusiasm, and Breadth of Coverage factors, and instructors gave themselves more favorable responses to items in the Individual Rapport and Examinations/Grading factors. Instructors also estimated that students spent fewer hours outside of class than was actually reported by students. The mean instructor self-evaluation on each of the 31 evaluation items (again excluding the Workload/Difficulty items), averaged across all instructor responses, correlates quite highly (r = .75) with the mean student ratings. This implies that students and teachers agree on what the teachers as a whole did best and worst. For example, both groups rate teachers as most effective at being enthusiastic about teaching, at being friendly toward students,' and at welcoming students to seek help pr advice. On the other hand, both groups view teachers as being less effective at having a.teaching style that holds student interest, at providing valuable feedback on examinations, and at presenting lectures in a way that facilitates taking good notes. Correlations between instructor self-ratings and ratings by their students are also presented in Table 5. The meaning of these correlations is similar to the other convergent validity coefficients that have already been discussed, except that they are based on individual items rather than factor scores. Across all courses, 34 of the 35 correlations reached statistical significance, the median correlation being .30. It is interesting to note that these correlations are lower than those obtained with factor scores, even when the factor scores were not corrected for unreliability (median uncorrected r = .37). The higher correspondence between the evaluation factors was primarily due to the greater reliability and generality of the factors, compared to the individual items. In summary, these findings demonstrated good agreement, both absolute and relative, between student evaluations and the corresponding evaluations by their instructors. Differences between mean student and instructor ratings were small, correlations were significant for 34 of 35 items, and the students and teachers agreed on the teaching behaviors that were performed more and less effectively by the teachers.
276
HERBERT W. MARSH
Table 5 Absolute and Relative Agreement Between Students' Evaluations and Instructor SelfEvaluations (N = 329 Classes)
Absolute agreement Faculty Evaluation items (paraphrased) Learning/Value Course challenging/stimulating Learned something valuable Increased subject interest Learned/understood subject matter Overall course rating Enthusiasm Enthusiastic about teaching Dynamic & energetic Enhanced presentations with humor Teaching style held your interest Overall instructor rating Organization Instructor explanations clear Course materials prepared & clear Objectives stated & pursued Lectures facilitated note taking Group Interaction Encouraged class discussions Students shared ideas/knowledge Encouraged questions & answers Encouraged expansion of ideas Individual Rapport Friendly toward students Welcomed seeking help/advice Interested in individual students Accessible to individual students Breadth of Coverage Contrasted implications Gave background of ideas/concepts Gave different points of view Discussed current developments Examinations/Grading Examination feedback valuable Eval. methods fair/appropriate Tested emphasized course content Assignments Readings/texts valuable Added to course understanding Workload/Difficulty Course difficulty (easy-hard) Course workload (light-heavy) Course pace (too slow-too fast) Hours/week outside class
M
SD
Students
M
SD
Difference M .27** .20** .07 .31** -.02

.08 .11* .37**
Relative agreement r .39** .26** .30** .35** .32** .27** .35** .39** .25** .33**
3.71 3.92
3.82 3.64
3.93
4.11 3.82 3.48
.82 .79 .81 .75 .82 .77

.85 1.04 .82 .71 .76 .82 .82 .95
3.98
4.12
.54
.47 .55 .43 .50 .56 .64 .69 .68 .68
3.89 3.96 3.91 4.19

3.94
3.62
3.88
3.81 3.68
4.02
.06 .14** .03 -.01 -.07 .19**
3.91
3.95 4.04 3.64 4.20 4.06 4.23 4.12 4.33
3.94 3.94 3.97 3.83 4.11 4.13 4.14 4.12 4.27 4.18 4.04
4.07
.56
.55 .51 .58
.13** .24**
.10* .24**
.88
.98
.56
.55 .51 .54 .49 .52 .55 .49 .47 .46 .44 .52 .55 .53
.78 .87
.69
-.09 .06 -.09* .00 -.06 -.15** -.27** -.10

.20** .16** .09* .04
.32** .43** .24** .31** .25** .18** .31** .19** .34** .31** .16** .39**
.18** .05 .14**
4.33 4.31 4.17 3.79

3.82
.77 .83 .82 .99 .93 .77 .81 .86

.72 .8]
3.96 3.98 3.74

4.13 4.11 3.91 3.93
3.99 3.98 4.05 4.02 3.72 3.86 3.93 3.84 3.91 3.46 3.39 3.13 3.26
.53
.51 .50 .48 .54 .37 .53
-.02 -.27** -.18** -.07

-.02
.90 .83 .83

.90 .72 1.03
.34** .32**
.41** .50** .09* .37**
3.42
3.36
.04
.03 .09* .35**
3.04 2.91
Note. Two-tailed statistical tests were used in determining absolute agreement (mean differences that appear in the column labeled "Difference"), since it was assumed that student evaluations might be higher or lower than instructor self-evaluations. One-tailed tests were used to test relative agreement (the correlations) because it was assumed that all correlations would be positive. Absolute agreement could not be tested for factor scores (weighted averages of 7, scores), since they have a mean of 0 (or some other arbitrary value). *p<.05. **p<.01.
277
Discussion Instructors evaluated the effectiveness of their own teaching and were evaluated by their students on the same 35-item evaluation form in a total of 329 different courses. In spite of faculty scepticism concerning the validity of student ratings and their belief that many sources of potential bias do substantially affect the ratings, there was good student-instructor agreement. Separate factor analyses of student and instructor ratings both resulted in the same set of nine evaluation factors that had been previously identified. This suggests that similar dimensions underlie both student and instructor evaluations. Correlations between students and instructor on the same factors were generally high (median r = .45) and always statistically significant, whereas correlations between student and instructor ratings on different factors tended to be low (median r = .02) and generally did not reach statistical significance. This argues for the validity of the ratings in general, for the distinctiveness of the different factors, and thus for the importance of considering multiple rating scales separately. Although the validity coefficients were slightly lower for graduate level courses (median r = .39 as opposed to .41 and .46 for undergraduate courses taught by faculty and teaching assistants, respectively), the general conclusions based on the entire set of courses were also true for each of the three sets of classes considered separately. This offers evidence for the validity of student ratings at difi'erent levels of university teaching. Several alternative approaches were used to explore both the convergent and divergent validity of the teacher evaluations. Convergent validity, that which is typically determined, refers to the relationship between student and instructor ratings on the same evaluation factor. The results of the study offered clear support for the convergent validity of teacher evaluations. Divergent or discriminant validity was assessed by seeking the answers to two related questions. First, is the student-instructor agreement on an evaluation factor specific to that factor, or can it be explained in terms of a generalized agreement common
to all the different factors? Second, are the correlations between the different factors as evaluated by faculty and students indicative of a halo effect, or do they represent true relationships among the underlying dimensions? The answer to the first question is quite clear; student-instructor agreement on each evaluation factor is specific and distinctive from other factors. Whereas correlations between student and instructor ratings on the same factor are uniformly high, correlations between their ratings on different factors are generally low. The question of a halo effect was somewhat more complicated. The ANOVA indicates a significant method/halo effect, and the Campbell-Fiske analysis suggests that the halo effect occurs primarily with student ratings. However, the Campbell-Fiske analysis also indicates that the pattern of correlations among factors was similar for student and instructor ratings. This suggests that the factors are truly correlated. Furthermore, the lower reliability of the instructor self-evaluations could also explain why the correlations among these factors are lower than observed for the student ratings. Nevertheless, the results do suggest that there is at least some halo effect in the student ratings. There is little evidence of a halo effect in the instructor self-evaluations. The four previous studies most comparable to this investigation reported convergent validities of .31, .65 (Braskamp et al., 1979), .47 (Doyle & Crichton, 1978), .62 (Webb & Nolan, 1955) and .49 (Marsh et al., 1979). Three of these, all but the Webb and Nolan study, also considered divergent validity. Doyle and Crichton found little support for the discriminant validity of the ratings, but their study was based on only 10 cases and they considered individual items rather than factor scores. Braskamp et al. reported only limited support for discriminant validity, but their study included only 17 cases and they did not use factor analytically derived rating scales. In the Marsh et al. study there was strong support for divergent validity, although there was also some indication of a halo effect in the student ratings. That study was based on ratings from 83 courses and employed factor scores that were de-
278
HERBERT W. MARSH Centra, J. A. Self-ratings of college teachers: A comparison with student ratings. Journal of Educational Measurement, 1973, 10, 287-295. Centra, J. A. Student ratings of instruction and their relationship to student learning. American Educational Research Journal, 1977, 14, 17-24. Doyle, K. 0., & Crichton, L. I. Student, peer, and self-evaluations of college instructors. Journal of Educational Psychology, 1978, 70, 815-826. Frey, P. W. Student ratings of teaching: Validity of several rating factors. Science, 1973,182, 83-85. Frey, P. W. A two-dimensional analysis of student ratings of instruction. Research in Higher Education, 1978, 9, 69-91. Frey, P. W., Leonard, D. W., & Beatty, N. W. Student ratings of instruction: Validation research. American Educational Research Journal, 1975, 12, 327-336. Hildebrand, N., Wilson, H. C., & Dienst, E. R. Evaluating university teaching. Berkeley: Center for Research and Development in Higher Education, University of California, 1971. Kavanagh, M. J., MacKinney, A. C., & Wolins, L. Issues in managerial performance: Multitraitmultimethod analysis of ratings. Psychological Bulletin, 1971, 75, 34-49. Levine, M. S. Canonical analysis and factor comparison (Sage University Paper Series on Quantitative Applications in the Social Sciences, Series No. 07-001). Beverly Hills, Calif.: Sage Publications, 1977. Marsh, H. W. The validity of students' evaluations: Classroom evaluations of instructors independently nominated as best and worst teachers by graduating seniors. American Educational Research Journal, 1977,74,441-447. Marsh, H. W., Fleiner, R., & Thomas, C. S. Validity and usefulness of student evaluations of instructional quality. Journal of Educational Psychology, 1975, 67, 833-839. Marsh, H. W., & Hocevar, 1). An application of LISREL modeling to multitrait-multimethod analysis. Proceedings of Australian Association For Research in Education, 1980, Annual Conference, 1980, 2, 282-295. Marsh, H. W., & Overall, J. U. Validity of students' evaluations of teaching: A comparison with instructor self-eualuations by leaching assistants, undergraduate faculty and graduate faculty. Paper presented at Annual Meeting of the American Educational Research Association, April 1979. (ERIC Document Reproduction Service No. ED 177 205) Marsh, H. W., & Overall, J. U. Validity of students' evaluations of teaching effectiveness: Cognitive and affective criteria. Journal of Educational Psychology, 1980,72, 468-475. Marsh, H. W., Overall, J. U., & Kesler, S. P. Validity of student evaluations of instructional effectiveness: A comparison of faculty self-evaluations and evaluations by their students. Journal of Educational Psychology, 1979, 71, 149-160. Marsh, H. W., Overall, J. U., & Thomas, C. S. The relationship between students' evaluation of instruction and expected grade. Paper presented at the Annual Meeting of the American Educational Research Association. San Francisco, April 1976.
rived from factor analysis. The results of the present study provide a strong replication of that previous finding. In summary, the present study has three important implications for the study and use of students' evaluations of teaching effectiveness. First, student ratings show good agreement with instructor self-evaluations of teaching effectiveness. This is a validity criterion that is acceptable to most student evaluation users, that can be applied in all instructional settings, and that may be helpful in overcoming facuLy reservations about the usefulness of student ratings. Second, there was consistent evidence for the validity of student ratings for both undergraduate and graduate level courses. Third, the distinctiveness of the different rating scales establishes the importance of using multifactor evaluation instruments that are developed with the use of factor analysis.. Much of the ambiguity in research on student ratings may stem from the unfortunate practice of assuming student ratings to be unidimensional. Reference Note
1. Heberlein, T. A. The Correction for Attenuation and the Multitrait-Multimethod Matrix. Unpublished Masters Thesis, University of Wisconsin, 1969.
References
Althauser, R, P., & Heberlein, T. A. Validity and the multitrait-multimethod matrix. In E. P. Borgatta & G. W. Bohrnstedt (Eds.), Sociological methodology 1970. San Francisco: Jossey-Bass, 1970. Alwin, I). F. Approaches to the interpretation of relationships in the multitrait-multimethod matrix. In H. L. Costner (Ed.), Sociological methodology 1973-1974. San Francisco: Jossey-Kass, 1974. Blackburn, R. T., & Clark, M. J. An assessment of faculty performance: Some correlates between administrators, colleagues, students, and self-ratings. Sociology of Kducation, 1975, 18, 242-256. Boruch, R. F., Larkin, J. D., Wolins, L., & MacKinney, A. C. Alternative methods of analysis: Multitraitmultimethod data. Educational and Psychological Measurement, 1970, 30, 833-853. Braskamp, L. A,, Caulley, I)., & Costin, F. Student ratings and instructor self-ratings and their relationship to student achievement. American Educational Research Journal, 1979, 16, 295-306. Campbell, I). T., & Fiske, D. W. Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 1959, 56', 81-105.
VALIDITY OF STUDENTS' EVALUATIONS (ERIC Document Reproduction Service No. ED 126 140) Nie, N. H., Hull, C. H., Jenkins, J. G., Steinbrenner, K., & Bent, D. H. Statistical Package for the Social Sciences. New York: McGraw-Hill, 1975. Overall, J. U., & Marsh, H. W. Students' evaluations of instruction: A longitudinal study of their stability. Journal of Educational Psychology, 1980, 72, 321-325. Schmitt, N., Coyle, B. W., & Saari, B. B. A review and critique of analysis of multitrait-multimethod matrices. Mullivariate Behauioral Research, 1977,12,
279
447-478. Stanley, J. C. Analysis of unreplicated three-way classifications with application to rater bias and trait independence. Psychometrika, 1961,26, 205-219. Thorndike, E. L. A constant error in psychological ratings. Journal of Applied Psychology, 1920, 22, 415-430. Webb, B. W., & Nolan, C. Y. Student, supervisor, and self-ratings of instructional proficiency. Journal of Educational Psychology, 1955,46, 42-46. Received December 22,1980
Manuscripts Accepted for Publication: Lead Articles for the 1982 Issues
June: Probing a Model of Educational Productivity in High School Science With National Assessment Samples. Herbert J. Walberg (College of Education, University of Illinois at Chicago Circle, Box 4348, Chicago, Illinois 60680), Ernest Pascarella, Geneva D. Haertel, Linda K. Junker, and F. David Boulanger. August: Pygmalion, Galatea, and the Golem: Investigations of Biased and Unbiased Teachers. Elisha Y. Babad, Jacinto Inbar, and Robert Rosenthal (Psychology and Social Relations, Harvard University, 33 Kirkland Street, Cambridge, Massachusetts 02138). October: Influence of Questions on the Allocation of Attention During Reading. Ralph E. Reynolds and Richard C. Anderson (Center for the Study of Reading, University of Illinois, 51 Gerty Drive, Champaign, Illinois 61820). December: School, Occupation, Culture, and Family: The Impact of Parental Schooling on the Parent-Child Relationship. Luis M. Laosa (Educational Testing Service, Princeton, New Jersey 08541).

Marsh (1982) - Validity of Students' Evaluations of College Teaching - A Multi-Trait-Multi-Method Analysis

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Marsh (1982) - Validity of Students' Evaluations of College Teaching - A Multi-Trait-Multi-Method Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

Journal of Educational Psychology 1982, Vol. 74, No. 2, 284-279.

Copyright 1982 by the American Psychological Association, Inc. 0022-0663/82/7402-0264$00.75

Validity of Students' Evaluations of College Teaching: A Multitrait-Multimethod Analysis

VALIDITY OF STUDENTS' EVALUATIONS

VALIDITY OP STUDENTS' EVALUATIONS

Std 23 15 12 12 25 55 70 66 59 40 07 03 -05 20 10 06 06 02 17 05 11 -11

Ins 25 02 05 12 29 42 70 58 64 54 24 -02 -08 09 02 -07 09 06 06 02 09 -11

Std 09 10 08 13 16 16 15 -04 23 23 55 73 49 58 01 -04 14 01 00 02 00 16

Ins -10 -02 07 12 09 00 01 06 20 09 42 69 41 53 03 -01 06 -11 -06 07 01 09

Ins -03 01 -03 11 02 15 05 02 00 02 04 -02 05 05 00 13 -02 09 78 75 77 43

Std 15 10 18 02 12 10 06 12 03 11 10 09 14 14 00 05 15 05 -01 -04 -01 20

Ins 27 00 08 -01 16 00 05 02 14 16 06 04 08 04 00 01 03 07 -05 04 -09 25

Ins 05 04 -04 07 -08 16 16 07 05 -08 01 03 27 06 00 -02 11 12 02 06 03 13

Ins 23 09 05 -04 27 09 05 -18 03 27 23 03 05 01 -05 -10 21 09 -05 20 -09 14

Std 29 16 14 -23 08 05 06 -07 -02 05 -08 01 06 -04 00 -02 00 00 -07 03 03 04

Std -05 08 04 23 -03 06 08 -06 12 -06 14 -20 14

08 -06 03 03 04 08 -01 -02 09 -08 -02

-01 04 -01 03 00 04 00-11

Student evaluation factor

82 01 03 -01 12 08 -01 -01

09 05 -09 08 -01 -03 52 13 00 -01 09 00

-12 -02 04 00 28 -14 06 -01 03

08 -09 00 -04 -02 09 -02 45 22

02 -09 -05 -08 00 02 -06 12 69

VALIDITY OF STUDENTS' EVALUATIONS

Variance component (MSc MSctm)/nm (MSct -- Msctm)/m (MScm - MSctm)/n Msctm

rH O CO O <* CO O C t-4 CO (N "tf

CD CO CO ' 0$ CD T-H ' O rH CD (

VALIDITY OF STUDENTS' EVALUATIONS

Difference M .27** .20** .07 .31** -.02

.82 .79 .81 .75 .82 .77

3.89 3.96 3.91 4.19

.06 .14** .03 -.01 -.07 .19**

-.09 .06 -.09* .00 -.06 -.15** -.27** -.10

4.33 4.31 4.17 3.79

.77 .83 .82 .99 .93 .77 .81 .86

3.96 3.98 3.74

-.02 -.27** -.18** -.07

.90 .83 .83

VALIDITY OF STUDENTS' EVALUATIONS

Das könnte Ihnen auch gefallen

Difference M .27 .20 .07 .31** -.02

.06 .14 .03 -.01 -.07 .19

-.09 .06 -.09* .00 -.06 -.15 -.27 -.10

-.02 -.27 -.18 -.07