Evaluating, in translation, is a prototypical concept with many extensions. Readers tend to view it as a matter of acceptability, adequacy, or quality, whereas other stakeholders conceive of it as an activity, or part of an activity, of proofreading, grading, correcting, revising, editing, assessing, and so on. Means and goals are also different quite often. These circumstances, together with enormously varied personal criteria and standards in evaluators, support the generally accepted view that evaluation cannot be studied. However, we are trying to find out whether studying the way subjects actually evaluate might shed light on some regularities which could help us better understand what is at stake. In other words, we are trying to find out whether there is some order in subjectivity. Evaluating several translations from the same original is something pretty unnatural in the market. It actually comes up nearly only in translator training, translator hiring, and translation criticism, and the third case is rather different from the other two. However, translator training and hiring are crucial activities for the industry. Can the repeated activity teach us something about evaluating translations? Does the repetition have an influence on the outcome of the evaluation? Those were the questions raised in this piece of research, which is a part of a larger effort by Toms Conde, under the supervision of Ricardo Muoz, within the activities of the Research Group Expertise and Environment in Translation (PETRA). The overarching purpose of this research project is to map intra- and intergroup coincidences and differences at evaluating 2 translations. This preliminary field study can be described as a piece of descriptive-relational research. It is descriptive, since it seeks to depict what already exists in a group or population, and it is relational since it investigates the connection between variables that are already present in the group or population. At this stage, and after a pilot study, we have already processed data from 10 students. The reduced number of subjects makes results relatively unimportant, but we think the findings are interesting enough to circulate them. In the near future, the research project will compare variables with larger amounts of subjects and also between different groups of population; apart from translation students, we will study professional translators, translation teachers, and addressees.
1. Materials and methods
35 students in their fourth year of the translation degree at the University of Granada were invited to assess / correct / proofread / edit / revise four sets of 12 translations each corresponding to four originals, according to their beliefs and intuition, and to the best of their knowledge. This report analyzes the data of ten of these subjects, the first to be completed. As for the texts, two of the originals (A and C) dealt with politics, and
1 Corresponding author: Toms Conde Ruano, Dpto. Traduccin e Interpretacin. Universidad de Granada. Granada E-18071 Spain. conderuano@gmail.com 2 We will use evaluate to cover the set of activities carried out by all subjects, independently of the way they envisioned their task. the other two (B and D), with technical procedures for painting machinery. Translations had been carried out by students from an earlier course, and they were chosen amongst those which were not assigned a very good grade by the teacher, so as to avoid that one translation would serve as a model for the evaluators. The sets were alternatively sequenced (A, politics; B, technical painting; C, politics; and D, technical painting) to prompt evaluators to think of them as separate tasks. Translations within each set were randomly ordered and coded for blind intervention by subjects. Originals and translations were provided as digital files, but printouts were provided upon request, and subjects were also allowed to print them out. The only constraints imposed to all subjects alike were the following three: They had to (1) process the translations in the order they were given; (2) work on a whole set of translations in a single session and, finally, (3) classify translations into one of four categories: very good, good, bad, and very bad. In order to allow for computing averages, quality judgments were assigned numerical values: very bad, 1; bad, 2; good, 3; very good, 4. We wanted to move away from popular approaches to evaluating translations which focus on mistakes and assign arbitrary values to poorly defined categories. To do so, some concepts had to be operationalized. Since evaluators do not only mark mistakes, we defined phenomenon as what motivates an evaluator to act onto a particular text segment. Phenomena were classified into two groups: normalized phenomena, and not- normalized phenomena. Normalized phenomena include typos, punctuation and spelling mistakes, formatting variations, concordance, syntax, weights & measurements, and the like, where there is a proper option sanctioned by an authority (normally, RAE). Not normalized phenomena include instances such as word order, paraphrase, register and different interpretations of the original. The classification is not homogeneous nor totally sharp, but it responds to the nature of the phenomena pretty well, does not demand a strong heuristic effort, and brings about a considerable reduction of undetermined phenomena (5.99%). On the other hand, phenomena were taken to be more or less salient according to the number of subjects who singled them out. Hence, a text segment where seven evaluators perform an action is thought of as more salient than a text segment where only three evaluators seem to notice a phenomenon. Due to problems of space, here saliency is reduced to phenomena where more than half of the evaluators coincide. An (evaluative) action is any mark introduced by the evaluator on the text. In this study, actions have been limited to those which remain once the evaluator is done and turns in the translation. Hence, inconsistencies and on line changes, which might be very informative, have not been taken into account here, and will be subject of future research. Professionals and some teachers often quote the amount of work needed to fix a translation. Since actions may entail varying quantities of work depending on their nature, systematicity, and other factors which may be specific for each phenomenon, they were operationalized as quantity of actions, and their types. The classification of actions into types was made according to a behavioral criterion, as observed in the evaluated translations. Actions observed so far can be divided into those made in the body of the text and those made at the margin (which also includes before and after the body of the text). In both cases, actions could also be classified into additions, suppressions, changes, marks, annotations, and comments. Also, some evaluators chose to code their marks so as to classify phenomena in some way. Evaluators do not seem to have clear or conscious criteria to evaluate translations, and many of those who state a set of parameters turn out to apply them rather unevenly in actual practice. Hence, to try to account for their standards we defined demand as the sum of conscious and unconscious expectations an evaluator seems to think that a translation should meet. Demand was operationalized from two perspectives: 1) level, that is, whether evaluators seem to expect more or less from a translation as reflected in their quality judgments; 2) evenness, or the uniformity or lack of variation in the level of demand. The second perspective may indicate the existence of clear and/or stable criteria for evaluating, or else an attempt to pursue some even-handedness. Order effects were defined as any consistent tendency across evaluators which cannot be explained as a feature of the translations when considered separately. We searched for three types of order effects: (1) within the whole task; (2) within each set; and (3), within texts. Since translations were evaluated in the same order, task effects were analyzed simply by observing changes in the progression of the task. To search for set effects, translations were grouped into three subsets, in such a way that subset I includes translations 1-4 from each set, subset II includes translations 5-8, and subset III includes translations 9-12. For example, subset I includes the actions carried out in translations A01-04, B01-04, C01-04, and D01-04. For effects within translations, originals were divided into three sections (initial, middle, and final) of roughly identical length, and translations were divided accordingly. Data were entered in an Access database and later analyzed with SPSS 12.0.
2. General results and discussion
For the purpose of framing our findings on effects at serial evaluation, we will first need to characterize subjects and their behaviors. Analytical parameters emerged from the detailed study of results and their comparison. The first parameter was final quality judgments, which were assigned numerical values, to allow for computing averages: very bad, 1; bad, 2; good, 3; very good, 4.
Table 1 displays average quality judgments for the 48 translations. In set A, translations A02 and A10 received the best grades, whereas A03 and A06 got the lowest. The median value of all translations was 2.1. Technical translations received higher grades than general translations. The amount of words in the translation does not correlate with corresponding quality judgments, although evaluators I05 and I03 tended to think that long translations are good (correlations of 0.295 and 0.291, respectively, significant at 0.05). Graphic 1 shows the frequency of average quality judgments in the task, which is close to a fairly typical distribution (Gauss bell), except for the fact that the curve is displaced to the left, probably because translations were purposefully chosen among the worst ones, to substantiate evaluators activities. Only nine translations were deemed Good or Very good (green columns). When the continuum is divided into three equal periods, then only two translations reach the highest third (blue background).
Graphic 1. Frequency of quality judgment averages in the task.
2.1. Demand 2.1.1. Level Table 2 provides some information to capture some specifics of subjects behavior. Correlations between quality judgments by evaluators were statistically significant between I02 and I06 (0.426), I04 and I10 (0.627), I05 and I07 (0.375), and I07 with I08 (0.384) and with I09 (0.596).
Evaluator I03 has the best opinion of the translations (set A average, 2.83; set C average, 2.92; general average, 2.74). Other generous or lenient evaluators are I04 (2.62 average), I11 (2.45 average) and I07 (2.27 average). On the other hand, I06 is the hardest or more demanding evaluator (set C average, 1.17; general average, 1.52), followed by I05 (1.69 average). Graphic 2 shows that subjects can be classified into three groups: I05 and I06 are the more demanding evaluators; I03, I04, I10, and I11 are the lenient, and I02, I07, I08 and I09 are in between. The intermediate group is remarkably homogeneous.
Subjects Graphic 2. Quality judgment, per subject.
2.1.2. Evenness Graphic 3 displays average quality judgments per set in the evaluators. I05 seems especially tough in set A (1.08 average) when compared to average, and I02 is very generous in set D (2.92). On the other hand, I08 is very regular throughout the sets (2.25 average), followed by I11, I04 and I03, who have better general opinions on the translations. Clearly, lenient evaluators (plus medium evaluator I08) seem more even than the rest in all texts.
Graphic 3. Set averages of subjectsquality judgments,.
2.2. Actions 2.2.1. Quantity The number of actions correlates significantly with quality judgments (- 0.535) when considered text by text, but not when analyzed by subjects. The total amount of actions is 11909 (table 3). Within sets, C and D show the largest variations, which may amount up to four times as many actions between translations.
Subjects differ widely in the number of actions they carry out (table 4). Subject I02 has only done a total of 627 actions, while I10 reached 1877, approx. three times as many.
set / subject I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 aver. A 23.08 59 42.5 62.5 37.58 28.42 26.33 47.83 72.67 55.92 45.58 B 8.91 24.92 15.5 30.83 15.83 14 14.67 13.33 30.17 34.5 20.27 C 13.75 19.25 20.67 14.67 19.92 11.17 15.58 12.08 25.17 27.25 17.95 D 6.5 16.33 13.58 16.33 16.5 11.33 16.25 11.83 28.42 17.33 15.44 a v e r a g e
Total 13 30 23 31 22 16 18 21 39 34 Nr. of actions 627 1434 1107 1492 1078 779 874 1021 1877 1620 Table 4. Quantity of actions, per subject.
Graphic 4. Set averages for subjects quantity of actions.
Graphic 4 displays the average quantity of actions that each subject performed for every set, ordered from left to right in decreasing total quantity of actions. Curiously, four out of the five subjects who performed more actions were the lenient evaluators, and medium evaluators performed fewer actions than demanding ones.
2.2.2. Types Marking is the only type of action which correlates significantly at 0.01 with quality judgments (so do changes at the margin, but they are very few). Evaluators I03, I04, I07, I09 and I11 tended to act on the text adding, suppressing, and changing text. On the opposite pole, subjects I02 and I08 were clearly oriented to offer feedback to the translator or the researcher, whereas the rest did not seem to have a clear pattern of behavior.
actions/subjects I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 total Classification 612 821 747 2180 Mark 568 636 142 3 2 127 39 1517 Addition 1 1 Note 22 22 m a r g i n
Actions co-occur in certain patterns. Adding in text strongly correlates with changing in text (0.896), suppressing (0.864) and marking (0.721). Other correlations show emergent profiles of coherent behavior: there seems to be a general tendency in that evaluators either tend to try to fix the translations for later use (text-oriented), or else seem to aim at providing explanations of the sense of their action to the translator or the researcher (feedback-oriented). Graphic 5 shows the distribution of the five more common actions in the subjects. Subjects I02 and I08 only classify phenomena, whereas I03, I04 and I09 focus on changing, adding, and suppressing in the body of texts. In any case, lenient evaluators seem to focus on introducing changes in the translations, whereas demanding evaluators tend to mark phenomena more often.
Graphic 5. Types of actions, per subject.
When contrasted to their level of demand, demanding evaluators turn out to prefer to just mark phenomena, medium evaluators perform more classifications, and lenient evaluators introduce more changes and suppressions. As for comments, no clear pattern emerged from their use. It is worth noting, however, that I05one of the more demanding evaluatorswas the subject who made more of them (37.5% of all), followed by I10 (14.4%). On the other hand, the subjects who introduced fewer comments were I04 (1.4%), I03 (2.8%), the two more lenient subjects.
2.3. Summary of subjects profiles Evaluators show consistent tendencies to adopt (1) a higher or lower level of demand, and (2) to confront different texts with a higher or lower degree of evenness. Their actions on the translations may be (3) more or less abundant; and (4) text-oriented or feedback-oriented; and (5) supported with a few or many comments.
Table 6 displays a summary of criteria, where evaluators have been grouped according to their results. Column I displays the level of demand, from the most lenient (1) to the most demanding (3). Column II displays the level of evenness, from the most even (1) to the most uneven evaluator (2). Column III displays the order of subjects according to the quantity of actions on texts, from the fewest (1), to the most abundant (3). Column IV ranks subjects from the most feedback-oriented (1) to the most text-oriented (4). Finally, column V ranks subjects according to the number of comments they introduced, from the fewest (1) to the most abundant (4). In brief, demanding evaluators tend to be feedback-oriented, and are uneven in their level of demand. Medium evaluators tend to perform few actions and are also pretty uneven. I02 and I07 behave similarly. Lenient evaluators seem more homogeneous: they are text-oriented, perform many actions when evaluating, and tend to be pretty even in their judgments. I03 and I04 seem particularly close in the way they evaluate. Or course, 10 evaluators are far too few to think that data can hold any consistent truth, but they are interesting from two perspectives: First, they point to possible tendencies and relationships between variables when analyzing the behavior of evaluators; hence, this research strategy seems promising and deserves further research. Second, the variation in evaluators behaviors is the background to contrast the regularities found in all of them. These regularities can be explained as order effects in serial translation evaluation.
3. Order effects
3.1. Order effects in the whole task Graphic 4 showed that the number of actions decreases dramatically from set A to the rest in all subjects. This is the first and most obvious order effect, and might be due to the lack of experience of the students as evaluators. They would start performing many actions to progressively realize that it meant too much work or else that it was not necessary to perform so many actions to carry out the task. As for the type of actions, graphic 6 shows that whereas the number of changes, additions and suppressions decreases between set A and set D, classifications are the only type of action that seems to increase. This supports the notion that decreasing actions by subjects might be due to their adjusting the effort to the task. In fact, classifications increase because one of the subjects changed her strategy: she stopped changing and started classifying in the middle of the task.
Graphic 6. Type of actions in different sets.
Graphic 7 shows that salient phenomena >5 that is, phenomena singled out by more than five evaluators stays around 20% in sets A, C, and D. The original in set B was the first technical translation and students were not familiar with the subject matter. This might explain the drop in coincidences. The relative increase in normalized phenomena within salient phenomena probably indicates that evaluators felt uncomfortable with the text. In any case, it is worth pointing out that normalized phenomena only account for ca. 5% of all actions.
Graphic 7. Percentage of salient phenomena (>5) in each set.
3.2. Order effects within sets Table 7 shows the amount of actions in the three subsets. There is a general tendency to reduce the quantity of actions per subset, which may be due to an improvement in efficiency. Again, this supports the notion of the evaluators learning how to carry out the task as they were doing it. The exception is set D, where subset III has more actions than subset II, but it also has a lower quality judgment average.
Subsets/Sets A B C D Subset ave. 1 1955 974 782 733 4444 2 1856 818 715 517 3906 3 1659 640 657 603 3559 Table 7. Amount of actions per subsets.
While there is a tendency for most types of action to appear less in subsets II and III across sets, suppressions increase in sets A, B, and D; additions and changes, in set D; and classifications in sets C and D. The increase of suppressions throughout three sets may be taken to indicate that evaluators have a clearer notion of the relevance of the information. The rise of classifications throughout sets C and D may be thought of as a consequence of the evaluators getting tired of repeating the same action.
Graphic 8. Percentage of salient (>5) phenomena, in each subset.
Graphic 8 shows a steady increase in the percentage of salient phenomena across the three subsets, probably an indicator of the degree of certainty in the evaluators. The drop in normalized phenomena in subset II might be explained as a function of the degree of confidence of the evaluators in the task.
3.3. Order effects within the texts Table 8 shows the relationship between number of actions in different translation sections and quality judgments. All of them are statistically significant but the closer to the end, the strongest the correlation. The tendency to increasing significance is evident at sentence level, since actions in the first sentences in the translations do not correlate with quality judgments. The relationship between quality judgments and actions in translation text segments which received a special typographic treatment or else stood out due to their position in the text, such as titles, headings, captions and the like, showed a lower significance than regular segments. Hence, visual prominence was ruled out as an explanation for first and last sentence correlations.
Translations Pearson Sig. (bil.) outstanding - 0.324* 0.025 Segment regular - 0.525** 0.000 first - 0.057 0.701 last - 0.514** 0.000 Sentence rest - 0.522** 0.000 initial - 0.411** 0.004 central - 0.548** 0.000 Section
final - 0.597** 0.000 ** Correlation significant at 0.01 Table 8. Relationship between quality judgment and actions.
Hence, evaluators seem to identify phenomena and perform actions on them in all sections of the translations, but the further down in the text, the stronger the effect on their judgment of the quality of the translation as a whole. Interestingly, this does not correspond to the percentage of salient phenomena, which drop in central sections.
Graphic 9. Percentage of salient phenomena (>5) in initial, central, and final sections of translations.
Quality judgments are independent of the quantity of actions introduced, when considered by subject. Lenient evaluators do perform more actions than demanding evaluators, and medium evaluators perform the fewest, as shown in graphic 10.
Graphic 10. Quantity of actions in initial, central, and final sections by lenient, medium, and demanding evaluators.
Another interesting effect can be traced when the number of actions in translations sections (graphic 11) that is, their initial, middle, and final partsis correlated to average quality judgments. Bad and Good translations show a similar pattern of subjects behavior, where initial sections contain an amount of actions which slightly decreases in central sections to minimally rise again in final sections. Very Bad translations, however, show a steady increase in the number of actions across sections, and Very good translations present a constant decrease in the number of actions as the text progresses. This might point to an emotional involvement of evaluators in the process.
Graphic 11. QUantity of actions in initial, central, and final sections of translations, per average quality judgment