Sie sind auf Seite 1von 14

Effects of Serial Translation Evaluation

Ricardo Muoz Martn & Jos Toms Conde Ruano


1

PETRA Research Group
University of Granada

Evaluating, in translation, is a prototypical concept with many extensions. Readers tend
to view it as a matter of acceptability, adequacy, or quality, whereas other stakeholders
conceive of it as an activity, or part of an activity, of proofreading, grading, correcting,
revising, editing, assessing, and so on. Means and goals are also different quite often.
These circumstances, together with enormously varied personal criteria and standards in
evaluators, support the generally accepted view that evaluation cannot be studied.
However, we are trying to find out whether studying the way subjects actually evaluate
might shed light on some regularities which could help us better understand what is at
stake. In other words, we are trying to find out whether there is some order in
subjectivity.
Evaluating several translations from the same original is something pretty unnatural
in the market. It actually comes up nearly only in translator training, translator hiring,
and translation criticism, and the third case is rather different from the other two.
However, translator training and hiring are crucial activities for the industry. Can the
repeated activity teach us something about evaluating translations? Does the repetition
have an influence on the outcome of the evaluation? Those were the questions raised in
this piece of research, which is a part of a larger effort by Toms Conde, under the
supervision of Ricardo Muoz, within the activities of the Research Group Expertise
and Environment in Translation (PETRA).
The overarching purpose of this research project is to map intra- and intergroup
coincidences and differences at evaluating
2
translations. This preliminary field study
can be described as a piece of descriptive-relational research. It is descriptive, since it
seeks to depict what already exists in a group or population, and it is relational since it
investigates the connection between variables that are already present in the group or
population. At this stage, and after a pilot study, we have already processed data from
10 students. The reduced number of subjects makes results relatively unimportant, but
we think the findings are interesting enough to circulate them. In the near future, the
research project will compare variables with larger amounts of subjects and also
between different groups of population; apart from translation students, we will study
professional translators, translation teachers, and addressees.


1. Materials and methods

35 students in their fourth year of the translation degree at the University of Granada
were invited to assess / correct / proofread / edit / revise four sets of 12 translations
each corresponding to four originals, according to their beliefs and intuition, and to the
best of their knowledge. This report analyzes the data of ten of these subjects, the first
to be completed. As for the texts, two of the originals (A and C) dealt with politics, and

1
Corresponding author: Toms Conde Ruano, Dpto. Traduccin e Interpretacin. Universidad de
Granada. Granada E-18071 Spain. conderuano@gmail.com
2
We will use evaluate to cover the set of activities carried out by all subjects, independently of the way
they envisioned their task.
the other two (B and D), with technical procedures for painting machinery. Translations
had been carried out by students from an earlier course, and they were chosen amongst
those which were not assigned a very good grade by the teacher, so as to avoid that one
translation would serve as a model for the evaluators. The sets were alternatively
sequenced (A, politics; B, technical painting; C, politics; and D, technical painting) to
prompt evaluators to think of them as separate tasks. Translations within each set were
randomly ordered and coded for blind intervention by subjects. Originals and
translations were provided as digital files, but printouts were provided upon request, and
subjects were also allowed to print them out.
The only constraints imposed to all subjects alike were the following three: They
had to (1) process the translations in the order they were given; (2) work on a whole set
of translations in a single session and, finally, (3) classify translations into one of four
categories: very good, good, bad, and very bad. In order to allow for computing
averages, quality judgments were assigned numerical values: very bad, 1; bad, 2; good,
3; very good, 4.
We wanted to move away from popular approaches to evaluating translations which
focus on mistakes and assign arbitrary values to poorly defined categories. To do so,
some concepts had to be operationalized. Since evaluators do not only mark mistakes,
we defined phenomenon as what motivates an evaluator to act onto a particular text
segment. Phenomena were classified into two groups: normalized phenomena, and not-
normalized phenomena. Normalized phenomena include typos, punctuation and spelling
mistakes, formatting variations, concordance, syntax, weights & measurements, and the
like, where there is a proper option sanctioned by an authority (normally, RAE). Not
normalized phenomena include instances such as word order, paraphrase, register and
different interpretations of the original. The classification is not homogeneous nor
totally sharp, but it responds to the nature of the phenomena pretty well, does not
demand a strong heuristic effort, and brings about a considerable reduction of
undetermined phenomena (5.99%). On the other hand, phenomena were taken to be
more or less salient according to the number of subjects who singled them out. Hence, a
text segment where seven evaluators perform an action is thought of as more salient
than a text segment where only three evaluators seem to notice a phenomenon. Due to
problems of space, here saliency is reduced to phenomena where more than half of the
evaluators coincide.
An (evaluative) action is any mark introduced by the evaluator on the text. In this
study, actions have been limited to those which remain once the evaluator is done and
turns in the translation. Hence, inconsistencies and on line changes, which might be
very informative, have not been taken into account here, and will be subject of future
research. Professionals and some teachers often quote the amount of work needed to
fix a translation. Since actions may entail varying quantities of work depending on their
nature, systematicity, and other factors which may be specific for each phenomenon,
they were operationalized as quantity of actions, and their types. The classification of
actions into types was made according to a behavioral criterion, as observed in the
evaluated translations. Actions observed so far can be divided into those made in the
body of the text and those made at the margin (which also includes before and after the
body of the text). In both cases, actions could also be classified into additions,
suppressions, changes, marks, annotations, and comments. Also, some evaluators chose
to code their marks so as to classify phenomena in some way.
Evaluators do not seem to have clear or conscious criteria to evaluate translations,
and many of those who state a set of parameters turn out to apply them rather unevenly
in actual practice. Hence, to try to account for their standards we defined demand as the
sum of conscious and unconscious expectations an evaluator seems to think that a
translation should meet. Demand was operationalized from two perspectives: 1) level,
that is, whether evaluators seem to expect more or less from a translation as reflected in
their quality judgments; 2) evenness, or the uniformity or lack of variation in the level
of demand. The second perspective may indicate the existence of clear and/or stable
criteria for evaluating, or else an attempt to pursue some even-handedness.
Order effects were defined as any consistent tendency across evaluators which
cannot be explained as a feature of the translations when considered separately. We
searched for three types of order effects: (1) within the whole task; (2) within each set;
and (3), within texts. Since translations were evaluated in the same order, task effects
were analyzed simply by observing changes in the progression of the task. To search for
set effects, translations were grouped into three subsets, in such a way that subset I
includes translations 1-4 from each set, subset II includes translations 5-8, and subset III
includes translations 9-12. For example, subset I includes the actions carried out in
translations A01-04, B01-04, C01-04, and D01-04. For effects within translations,
originals were divided into three sections (initial, middle, and final) of roughly identical
length, and translations were divided accordingly. Data were entered in an Access
database and later analyzed with SPSS 12.0.


2. General results and discussion

For the purpose of framing our findings on effects at serial evaluation, we will first need
to characterize subjects and their behaviors. Analytical parameters emerged from the
detailed study of results and their comparison. The first parameter was final quality
judgments, which were assigned numerical values, to allow for computing averages:
very bad, 1; bad, 2; good, 3; very good, 4.

1 2 3 4 5 6 7 8 9 10 11 12 set ave.
A 2.1 2.2 1.2 1.5 1.6 1.1 2.1 1.6 1.8 2.9 2.1 2.4 1883
B 2.4 2.6 2.7 1.7 2.5 1.9 2.8 2.4 2.6 2.6 2.2 2.2 2383
C 2.2 2.2 1.6 1.2 2.0 2.2 1.3 3.0 2.5 2.3 2.3 1.5 2025
D 3.3 2.2 1.8 1.6 2.6 2.4 2.3 2.0 1.6 2.0 1.9 2.0 2141
Table 1. Average quality judgments.

Table 1 displays average quality judgments for the 48 translations. In set A, translations
A02 and A10 received the best grades, whereas A03 and A06 got the lowest. The median
value of all translations was 2.1. Technical translations received higher grades than general
translations. The amount of words in the translation does not correlate with corresponding
quality judgments, although evaluators I05 and I03 tended to think that long translations are
good (correlations of 0.295 and 0.291, respectively, significant at 0.05).
Graphic 1 shows the frequency of average quality judgments in the task, which is
close to a fairly typical distribution (Gauss bell), except for the fact that the curve is
displaced to the left, probably because translations were purposefully chosen among the
worst ones, to substantiate evaluators activities. Only nine translations were deemed
Good or Very good (green columns). When the continuum is divided into three equal
periods, then only two translations reach the highest third (blue background).


Graphic 1. Frequency of quality judgment averages in the task.

2.1. Demand
2.1.1. Level
Table 2 provides some information to capture some specifics of subjects behavior.
Correlations between quality judgments by evaluators were statistically significant
between I02 and I06 (0.426), I04 and I10 (0.627), I05 and I07 (0.375), and I07 with I08
(0.384) and with I09 (0.596).

subjectt/set A B C D Total
Aver.
s.d.
Aver.
s.d.
Aver.
s.d.
Aver.
s.d.
aver.
s.d.
I02 2.08
0.996
1.82
0.603
2.25
1.055
2.92
0.669
2.28
0.926
I03 2.83
0.835
2.64
0.505
2.92
0.669
2.58
0.669
2.74
0.675
I04 2.42
0.793
2.73
0.786
2.83
0.835
2.50
0.905
2.62
0.822
I05 1.08
0.289
2.25
0.622
1.67
0.651
1.75
1.138
1.69
0.829
I06 1.83
0.835
1.50
0.674
1.17
0.389
1.58
0.996
1.52
0.772
I07 2.25
0.965
2.92
0.793
2.17
0.937
1.75
0.452
2.27
0.893
I08 2.08
0.900
2.42
0.900
2.33
1.155
2.17
1.030
2.25
0.978
I09 2.00
0.853
2.92
0.793
2.42
0.793
1.58
0.515
2.23
0.881
I10 2.67
0.492
2.33
0.778
2.50
0.659
I11 2.25
0.965
2.58
1.084
2.50
0.905
2.45
0.934
2.45
0.951
Table 2. Quality judgment, per subject and set.

Evaluator I03 has the best opinion of the translations (set A average, 2.83; set C
average, 2.92; general average, 2.74). Other generous or lenient evaluators are I04 (2.62
average), I11 (2.45 average) and I07 (2.27 average). On the other hand, I06 is the
hardest or more demanding evaluator (set C average, 1.17; general average, 1.52),
followed by I05 (1.69 average). Graphic 2 shows that subjects can be classified into
three groups: I05 and I06 are the more demanding evaluators; I03, I04, I10, and I11 are
the lenient, and I02, I07, I08 and I09 are in between. The intermediate group is
remarkably homogeneous.


Subjects
Graphic 2. Quality judgment, per subject.

2.1.2. Evenness
Graphic 3 displays average quality judgments per set in the evaluators. I05 seems
especially tough in set A (1.08 average) when compared to average, and I02 is very
generous in set D (2.92). On the other hand, I08 is very regular throughout the sets (2.25
average), followed by I11, I04 and I03, who have better general opinions on the
translations. Clearly, lenient evaluators (plus medium evaluator I08) seem more even
than the rest in all texts.

Graphic 3. Set averages of subjectsquality judgments,.

2.2. Actions
2.2.1. Quantity
The number of actions correlates significantly with quality judgments (- 0.535) when
considered text by text, but not when analyzed by subjects. The total amount of actions
is 11909 (table 3). Within sets, C and D show the largest variations, which may amount
up to four times as many actions between translations.

text / set A B C D
aver. s.d. aver. s.d. aver. s.d. aver. s.d.
01 44.90 21.702 24.00 9.684 27.00 15.727 8.80 5.181
02 37.10 15.366 18.00 5.249 18.20 6.374 15.80 8.664
03 56.80 20.471 16.30 4.347 22.60 8.579 22.70 7.675
04 56.70 25.975 19.90 6.226 29.60 11.138 26.00 5.598
05 42.40 20.250 16.90 4.701 19.80 8.053 10.40 5.296
06 62.70 24.784 22.10 6.557 14.50 10.157 12.50 8.100
07 32.90 15.871 14.90 7.534 24.10 10.682 12.00 6.716
08 47.60 20.007 17.60 8.072 23.40 19.945 16.80 8.257
09 63.30 18.331 18.30 8.433 13.10 7.370 11.70 4.877
10 30.10 17.272 14.60 8.605 13.50 7.706 17.50 7.634
11 37.20 17.561 15.50 7.200 14.50 8.567 17.70 6.701
12 35.30 15.151 17.30 8.629 22.90 9.597 13.40 5.621
set aver. 45.58
19.4
17.95
7.103
20.27
10.32
15.44
6.693
Table 3. Quantity of actions, per translation.

Subjects differ widely in the number of actions they carry out (table 4). Subject I02 has
only done a total of 627 actions, while I10 reached 1877, approx. three times as many.

set / subject I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 aver.
A 23.08 59 42.5 62.5 37.58 28.42 26.33 47.83 72.67 55.92 45.58
B 8.91 24.92 15.5 30.83 15.83 14 14.67 13.33 30.17 34.5 20.27
C 13.75 19.25 20.67 14.67 19.92 11.17 15.58 12.08 25.17 27.25 17.95
D 6.5 16.33 13.58 16.33 16.5 11.33 16.25 11.83 28.42 17.33 15.44
a
v
e
r
a
g
e

Total 13 30 23 31 22 16 18 21 39 34
Nr. of actions 627 1434 1107 1492 1078 779 874 1021 1877 1620
Table 4. Quantity of actions, per subject.


Graphic 4. Set averages for subjects quantity of actions.

Graphic 4 displays the average quantity of actions that each subject performed for every
set, ordered from left to right in decreasing total quantity of actions. Curiously, four out
of the five subjects who performed more actions were the lenient evaluators, and
medium evaluators performed fewer actions than demanding ones.

2.2.2. Types
Marking is the only type of action which correlates significantly at 0.01 with quality
judgments (so do changes at the margin, but they are very few). Evaluators I03, I04,
I07, I09 and I11 tended to act on the text adding, suppressing, and changing text. On the
opposite pole, subjects I02 and I08 were clearly oriented to offer feedback to the
translator or the researcher, whereas the rest did not seem to have a clear pattern of
behavior.

actions/subjects I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 total
Classification 612 821 747 2180
Mark 568 636 142 3 2 127 39 1517
Addition 1 1
Note 22 22
m
a
r
g
i
n

Change 54 104 158
Addition 2 288 113 150 32 125 99 172 175 1156
Suppression 141 134 85 47 65 90 87 206 855
Change 962 838 620 212 401 809 727 1151 5720
i
n

t
e
x
t

Note 2 28 14 46 46 50 21 17 49 273
Doubtful 11 15 1 27
Total 627 1434 1107 1492 1078 779 874 1021 1877 1620 11909
Table 5. Types of actions, per subject.

Actions co-occur in certain patterns. Adding in text strongly correlates with changing in
text (0.896), suppressing (0.864) and marking (0.721). Other correlations show
emergent profiles of coherent behavior: there seems to be a general tendency in that
evaluators either tend to try to fix the translations for later use (text-oriented), or else
seem to aim at providing explanations of the sense of their action to the translator or the
researcher (feedback-oriented). Graphic 5 shows the distribution of the five more
common actions in the subjects. Subjects I02 and I08 only classify phenomena, whereas
I03, I04 and I09 focus on changing, adding, and suppressing in the body of texts. In any
case, lenient evaluators seem to focus on introducing changes in the translations,
whereas demanding evaluators tend to mark phenomena more often.


Graphic 5. Types of actions, per subject.

When contrasted to their level of demand, demanding evaluators turn out to prefer to
just mark phenomena, medium evaluators perform more classifications, and lenient
evaluators introduce more changes and suppressions.
As for comments, no clear pattern emerged from their use. It is worth noting,
however, that I05one of the more demanding evaluatorswas the subject who made
more of them (37.5% of all), followed by I10 (14.4%). On the other hand, the subjects
who introduced fewer comments were I04 (1.4%), I03 (2.8%), the two more lenient
subjects.

2.3. Summary of subjects profiles
Evaluators show consistent tendencies to adopt (1) a higher or lower level of demand,
and (2) to confront different texts with a higher or lower degree of evenness. Their
actions on the translations may be (3) more or less abundant; and (4) text-oriented or
feedback-oriented; and (5) supported with a few or many comments.


demand actions
Level Even Quant Type Comm
I02 2 1 1 1 3
I03 1 2 3 4 1
I04 1 2 2 4 1
I05 3 1 3 3 4
I06 3 1 2 2 2
I07 2 1 1 3 3
I08 2 2 1 1 1
I09 2 1 2 4 1
I10 1 2 3 1 3
I11 1 2 3 3 2
Table 6. Summary of subjects characteristics.

Table 6 displays a summary of criteria, where evaluators have been grouped according
to their results. Column I displays the level of demand, from the most lenient (1) to the
most demanding (3). Column II displays the level of evenness, from the most even (1)
to the most uneven evaluator (2). Column III displays the order of subjects according to
the quantity of actions on texts, from the fewest (1), to the most abundant (3). Column
IV ranks subjects from the most feedback-oriented (1) to the most text-oriented (4).
Finally, column V ranks subjects according to the number of comments they introduced,
from the fewest (1) to the most abundant (4). In brief, demanding evaluators tend to be
feedback-oriented, and are uneven in their level of demand. Medium evaluators tend to
perform few actions and are also pretty uneven. I02 and I07 behave similarly. Lenient
evaluators seem more homogeneous: they are text-oriented, perform many actions when
evaluating, and tend to be pretty even in their judgments. I03 and I04 seem particularly
close in the way they evaluate.
Or course, 10 evaluators are far too few to think that data can hold any consistent
truth, but they are interesting from two perspectives: First, they point to possible
tendencies and relationships between variables when analyzing the behavior of
evaluators; hence, this research strategy seems promising and deserves further research.
Second, the variation in evaluators behaviors is the background to contrast the
regularities found in all of them. These regularities can be explained as order effects in
serial translation evaluation.

3. Order effects

3.1. Order effects in the whole task
Graphic 4 showed that the number of actions decreases dramatically from set A to the
rest in all subjects. This is the first and most obvious order effect, and might be due to
the lack of experience of the students as evaluators. They would start performing many
actions to progressively realize that it meant too much work or else that it was not
necessary to perform so many actions to carry out the task.
As for the type of actions, graphic 6 shows that whereas the number of changes,
additions and suppressions decreases between set A and set D, classifications are the
only type of action that seems to increase. This supports the notion that decreasing
actions by subjects might be due to their adjusting the effort to the task. In fact,
classifications increase because one of the subjects changed her strategy: she stopped
changing and started classifying in the middle of the task.



Graphic 6. Type of actions in different sets.


Graphic 7 shows that salient phenomena >5 that is, phenomena singled out by more
than five evaluators stays around 20% in sets A, C, and D. The original in set B was
the first technical translation and students were not familiar with the subject matter.
This might explain the drop in coincidences. The relative increase in normalized
phenomena within salient phenomena probably indicates that evaluators felt
uncomfortable with the text. In any case, it is worth pointing out that normalized
phenomena only account for ca. 5% of all actions.





Graphic 7. Percentage of salient phenomena (>5) in each set.



3.2. Order effects within sets
Table 7 shows the amount of actions in the three subsets. There is a general tendency to
reduce the quantity of actions per subset, which may be due to an improvement in
efficiency. Again, this supports the notion of the evaluators learning how to carry out
the task as they were doing it. The exception is set D, where subset III has more actions
than subset II, but it also has a lower quality judgment average.


Subsets/Sets A B C D Subset ave.
1 1955 974 782 733 4444
2 1856 818 715 517 3906
3 1659 640 657 603 3559
Table 7. Amount of actions per subsets.


While there is a tendency for most types of action to appear less in subsets II and III
across sets, suppressions increase in sets A, B, and D; additions and changes, in set D;
and classifications in sets C and D. The increase of suppressions throughout three sets
may be taken to indicate that evaluators have a clearer notion of the relevance of the
information. The rise of classifications throughout sets C and D may be thought of as a
consequence of the evaluators getting tired of repeating the same action.




Graphic 8. Percentage of salient (>5) phenomena, in each subset.

Graphic 8 shows a steady increase in the percentage of salient phenomena across the
three subsets, probably an indicator of the degree of certainty in the evaluators. The
drop in normalized phenomena in subset II might be explained as a function of the
degree of confidence of the evaluators in the task.

3.3. Order effects within the texts
Table 8 shows the relationship between number of actions in different translation
sections and quality judgments. All of them are statistically significant but the closer to
the end, the strongest the correlation. The tendency to increasing significance is evident
at sentence level, since actions in the first sentences in the translations do not correlate
with quality judgments. The relationship between quality judgments and actions in
translation text segments which received a special typographic treatment or else stood
out due to their position in the text, such as titles, headings, captions and the like,
showed a lower significance than regular segments. Hence, visual prominence was ruled
out as an explanation for first and last sentence correlations.


Translations Pearson Sig. (bil.)
outstanding - 0.324* 0.025
Segment
regular - 0.525** 0.000
first - 0.057 0.701
last - 0.514** 0.000 Sentence
rest - 0.522** 0.000
initial - 0.411** 0.004
central - 0.548** 0.000
Section

final - 0.597** 0.000
** Correlation significant at 0.01
Table 8. Relationship between quality judgment
and actions.


Hence, evaluators seem to identify phenomena and perform actions on them in all
sections of the translations, but the further down in the text, the stronger the effect on
their judgment of the quality of the translation as a whole. Interestingly, this does not
correspond to the percentage of salient phenomena, which drop in central sections.


Graphic 9. Percentage of salient phenomena (>5) in initial, central, and
final sections of translations.

Quality judgments are independent of the quantity of actions introduced, when
considered by subject. Lenient evaluators do perform more actions than demanding
evaluators, and medium evaluators perform the fewest, as shown in graphic 10.


Graphic 10. Quantity of actions in initial, central, and final sections by
lenient, medium, and demanding evaluators.




Another interesting effect can be traced when the number of actions in translations
sections (graphic 11) that is, their initial, middle, and final partsis correlated to
average quality judgments. Bad and Good translations show a similar pattern of
subjects behavior, where initial sections contain an amount of actions which slightly
decreases in central sections to minimally rise again in final sections. Very Bad
translations, however, show a steady increase in the number of actions across sections,
and Very good translations present a constant decrease in the number of actions as the
text progresses. This might point to an emotional involvement of evaluators in the
process.



Graphic 11. QUantity of actions in initial, central, and final sections of
translations, per average quality judgment

Das könnte Ihnen auch gefallen