Beruflich Dokumente
Kultur Dokumente
net/publication/234577864
CITATIONS READS
15 301
1 author:
David Coniam
The Education University of Hong Kong
88 PUBLICATIONS 905 CITATIONS
SEE PROFILE
All content following this page was uploaded by David Coniam on 02 April 2015.
ABSTRACT
KEYWORDS
INTRODUCTION
While computerized testing in the 1980s and 1990s has been receiving attention due to
the increased availability and power of computers, computerization has been essentially
in two areas:
The ease of marking and immediate delivery of scores via multiple-choice and nth
deletion exact-scoring cloze tests has inevitably resulted in the development of many
such tests in computerized form.
The establishing of tests and item banks calibrated by Item Response Theory techniques
to determine subjects’ ability through a minimum of test items has been an area where
there has been ample recent research and development (Hambleton 1989; Carlson 1994;
Meunier 1994).
...something that could not be done equally by other means (e.g. by pen
and paper).
In a subsequent 1991 paper, Alderson notes, however, that surprisingly little has been
done on the development of computer based tests:
It might be argued that the innovative possibilities which CBELT offers have been
realized in CALT programs, where subjects can be awarded a score on the basis of
many fewer items than are necessary with traditional modes of assessment where
subjects work sequentially through an entire tests. See Smittle (1991) for example, for a
description of a CALT test of reading. Since 1993 it has been possible to take the
Graduate Record Examination for entry to graduate school in the US in this way. And
in Singapore, where the University of Cambridge Local Examination Syndicate is
experimenting with self-access testing by telephone, it is possible to dial in and take a
test by modem.
As Meunier rightly observes, however, the focus of CALT tests too often centers around
the testing of grammar via the multiple-choice or cloze formats, since open-ended
responses cannot be dealt in CALT tests (1994, 37). The tests described above, for
example, focus on reading and general language proficiency via such restricted formats.
Further, the testing of the listening skill is an issue which has received much less
attention in terms of self-access testing (see, however, Mydlarski and Paramskas 1985;
Ariew and Dunkel 1989; Dunkel 1991). One of the reasons for this lack of attention is
self-evident: the testing of listening by conventional means necessitates playing an
audio tape—usually in a classroom or a school hall. For reasons of security, listening
tests cannot be repeated over and over again as tests of reading can.
The focus of the testing of listening in this paper is for Hong Kong secondary schools,
with which the current author has had considerable contact in terms of the assessment
that takes place there. Current testing of listening involves large groups of subjects
taking the same test at one time, a factor which places severe operational constraints on
running listening tests. For reasons of face validity, Hong Kong secondary schools feel
that different tests have to be prepared and administered each time so as to prevent
students gaining an advantage by having heard the test before.
Oller (1979) suggests that the partial dictation concept—where all of the test material is
presented in auditory form, but only part of the test material is presented in written
form—is a valid pragmatic testing measure because it requires subjects to interpret
what they hear as part of natural spoken discourse, and hence subjects’ global language
proficiency can be tapped (1979, 266). While the concept of a single factor to account for
second language proficiency is disputed by certain researchers (e.g., Bachman and
Palmer 1982), the work of Fouly (1985) and Fouly and Cziko (1985) in dictation would,
however, appear to provide evidence supporting dictation as a valid tool for sampling
second language proficiency, and listening proficiency in particular.
One of Oller’s proposed marking schemes for dictation is that one mark should be
deducted for each error, as long as the errors are not simply spelling errors (1987, 282).
Examples which Oller provides of spelling for which no mark has been deducted
(“poisened” for “poisoned”’; “repeate” for “repeat”) indicate a substantial amount of
judgement involved (1979, 289). It has not been able to incorporate “judgement” as a
factor in the current computerized dictation test: the marking scheme essentially
consists of marking a word as either correct or incorrect on the basis of grammar and
spelling. As is discussed in the Results section below, in the current dictation, very little
misspelling actually occurred, suggesting that the necessity for correctly-spelt words
does not invalidate the testing procedure.
TEST DESIGN
After subjects have heard an utterance, they are given time to type it in. (Utterances
need to be typed in exactly, although as it is a listening test, punctuation/ capitalization
are not scored.) The utterance is then parsed for accuracy. The current author
experimented with a number of methods of marking, and the one which produced the
most consistent results, and which the program utilizes in this study is an adaptation of
that proposed by Oller (1979). The marking procedure involves the following:
• one mark is awarded for a correct word in the correct place in the utterance
A simple calculation then converts the total mark for an item to a percentage in terms of
the number of words in an utterance. Fouly and Cziko (1985) recommend the procedure
of awarding 1.0 marks for a totally correct answer and zero for an answer which
contains any incorrect words. While this is a simpler operation in terms of marking,
such a marking scheme leaves little room for variability within subjects’ answers. (The
To put the current marking system into perspective, an example will now be given.
Consider the second exchange which subjects see and hear (the second speaker’s
utterance—which is to be typed in—is in italics.):
With the system for scoring set at 1.0 marks for a correctly-spelt word in the correct
place, and a correct word somewhere in the utterance, Table 1 below presents two
subjects’ responses.
Subject
1 Input
i started learning at about ten Total
Score 1.0 1.0 1.0 0 0 0.5 3.5/7 50%
Subject
2 Input
i start for learning about i was ten Total
Score 1.0 0 0 0.5 0 0.5 0.5 0.5 3/7 43%
Table 1. Two subjects’ responses for the utterance “I started learning when I was ten”
On the above item, Subject 1 has scored higher than Subject 2. While we would agree
that his is correct in that Subject 1 appears to have understood more than Subject 2, the
score differential does not really reflect this. To what extent this points up problems
with current testing procedure is discussed below where correlations between the
computerized dictation and a conventional pen-and-paper listening test are presented.
The nine exchanges of the dialogue were constructed so that, in line with the work of
Fouly and Cziko (1985) on the scalability of difficulty of dictation items, earlier test
items were short, with a gradual increase in the number of words tested in each item.
While there is a progression of difficulty through the utterances, this progression was
not mathematically enforced, as this would have made for a rather unnatural-sounding
dialogue.
It can be seen that, while there is not an absolute increase in the number of words in
each utterance, the tendency is for listening demands to increase through the course of
the dialogue. While word frequency is no guarantee of the accessibility of a text, there is
evidence that a strong relationship exists between control over the most frequent words
and language proficiency (Harlech-Jones 1983; Mears and Jones 1988; Sinclair 1991).
Percentage of all
English text
accounted for by the
number most
frequent words Speaker A Speaker B
Secondary 7 students would be expected to know all the words in the dialogue, since all
the words fall within the 90% most frequent words—with the vast majority occurring
within the 2,145 (the top 80%) most frequent. Given this, the test can therefore be
viewed, as Oller (1979) argues, as a test of interpreting in context, rather than simply
accuracy in terms of spelling. Conversely if students are not familiar with a lot of the
words, they also have to guess from the limited context, which affects the validity of the
exercise. Eight words fell outside the tope 80%. These were:
Allied to the fact that the majority of the words in the dialogue were known entities is
the fact that for Hong Kong students generally spelling is not a major problem. The
stipulation of the rule that words must be spelt correctly is therefore an acceptable
constraint.2
A time limit of five seconds for each word in an utterance was set—set so that the test
would terminate if a student was extremely indecisive or slow to respond. With test
item 9, for example, which was 12 words long, subjects had 60 seconds in which to enter
their answer.
SUBJECTS
The subjects who took the class were the 28 students in a Secondary 7 class in a local
Hong Kong secondary school. Prior to taking the computer listening test, they
attempted a number of short multiple-choice items which had been calibrated against a
representative sample of the Hong Kong cohort of secondary school students (Coniam
1995). The results of these items placed the general language proficiency of the students
between Secondary 5 and Secondary 6. These test results are purely for the reader’s
reference, so that the subjects’ ability can be placed within the Hong Kong context: the
relationship between listening and general proficiency is not a concern of the current
study.
The subjects constituted a general Secondary 7 class, who were not particularly
computer-literate. To forestall any possible unfamiliarity, they were therefore given a
few minutes to familiarize themselves with the notebook computer and its keyboard
Subjects took the test via headphones at the back of the class while the teacher
conducted a normal lesson. While these conditions are far from ideal, it underscores the
fact that a listening test can be conducted on an unobtrusive, self-access basis.
RESULTS
As mentioned above, subjects were allowed five seconds for each word in a particular
utterance. Subjects did not report any instances, however, of having had insufficient
time, so the time limit would appear to be acceptable.
For a proficiency test, the optimum mean for the test is in the region of 50% (Gronlund
1985, 253). Such a mean suggests—as an initial indicator—that the test is neither too
easy nor too difficult for the subjects and roughly matches subjects’ level of proficiency.
The results for the test are presented in Table 4 below. The overall mean for the test was
43%—slightly lower than the optimum of 50%—and reflects subjects’ general lower
ability, as indicated by the group’s performance on the short items. The standard
deviation for the test was 11.1%, which is very similar to that obtained by the HKEA on
the Use of English examination, and indicates that the group are fairly homogeneous in
terms of ability.
The issue of dictation being simply a disguised test of spelling was alluded to above.
While there were incidences of misspelling—some examples of which are presented in
Table 5—an examination of subjects’ output provided little evidence of incorrectly-spelt
words which distorted the scoring.
In general, thee was very little misspelling—with few incidences of very high frequency
words such as “wood” or “woudl” for “would” etc. being misspelled. Interestingly, in
the case of the word ”foreigners,” six of the 28 subjects typed in “stranger” or
“strangers”—underscoring the fact that the listening test is sampling context as much as
straightforward accuracy, and that subjects did in fact comprehend the test and the task.
The fact that the test can be administered and marked on a self-access basis suggests
that, in terms of a computer listening test as a diagnostic device for schools, the test may
well have potential. With developments in computer technology, and the fact that all
schools now have quite large computer labs, it may well be possible to run the same
test independently on a number of computers.
In this study, subjects have been awarded one mark for a correct word in the correct
position in the utterance and 0.5 of a mark for the correct word appearing somewhere
in the utterance. One minor problem here is with inflected words, where a subject has
missed off an “s” or ”ed” ending. For example, the first exchange of the dialogue is as
follows:
If subjects enter “for about six year,” they score zero for the word “year.” One
amendment to the scoring procedure with which the author has been experimenting
involves comparing subjects’ input on words where common inflections may have been
omitted against the Bank of English list of most frequent words, and awarding a score
of 0.25 for such input. It appears, however, that such an amendment makes only a
minor difference to subjects’ overall scores and to overall results such as test mean and
concurrent validity. Further, it may be argued, that such an amendment places a tighter
focus on word-for-word accuracy and possibly reduces the listening in context which
the test is sampling.
At present, to give sufficient surrounding context, subjects see what Speaker A says, but
only hear what Speaker B says. Another option with which the current author has been
experimenting and which would improve the validity of the listening test is for subjects
to only hear both speakers, without the support of Speaker A’s words on the screen. This
is, however, an option which needs to be field-tested for it viability.
Given that the sample of one class of 28 students on which the test has been run is
rather limited, conclusions can only be tentative. However, the fact that the computer
listening test and the Use of English mock listening examination correlated significantly
at 0.46 (p= 0.01) suggests that the concept of the computer listening test is one which is
worth pursuing.
NOTES
1
The figures of the number of words and the amount of English text for which they
account were obtained from a frequency list derived in February 1996 from the 211-
million-word Cobuild Bank of English corpus.
2
The HKEA recognized this fact in the criterion-referenced marking scheme for the
pre-1980 HKCE composition paper where a maximum of 5 marks only could be
deducted for incorrect spelling.
3
30% is generally regarded as the cutoff point in terms of item difficulty—see Falvery,
Holbrook and Coniam 1994, 119ff for an elaboration.
REFERENCES
Alderson, J. (1991). “Language Testing in the 1990s: How Far Have We Come? How
Much Further Have We to Go?” Current Developments in Language Testing, edited
by S. Anivan. SEAMEO, Regional Language Centre.
Bachman, L., and A. Palmer (1982). “The Construct Validation of some Components of
Communicative Competence.” TESOL Quarterly 16, 449-465.
Coniam, D. (1995). “Towards a Common Ability Scale for Hong Kong English
Secondary School Forms.” Language Testing 12, 2, 184-195.
Falvey, P., J. Holbrook and D. Coniam (1994). Assessing Students. Hong Kong: Longman.
Fouly, K. (1985). A Multivariate Study of the Nature of Language Proficiency and its
Relationship to Learner Traits: A Confirmatory Approach. Unpublished Ph.D.
thesis. University of Illinois at Urbana-Champaign.
______., and G. Cziko (1985). “Determining the Reliability, Validity and Scalability of
the Graduated Dictation Test.” Language Learning 35, 5, 555-566.
Harlech-Jones, B. (1983). “ESL Proficiency and a Word Frequency Count.” ELT Journal
37, 1, 62-70.
Meara, P., and G. Jones (1988). “Vocabulary Size as a Placement Indicator.” Applied
Linguistics in Society: Papers from the 20th Annual Meeting of the British Association for
Applied Linguistics, edited by Pamela Grunwell. Nottingham, September 1987.
Mydlarski, D., and D. Paramskas (1985). “Template System for Second Language Aural
Comprehension.” CALICO Journal 3, 8-12.
Perkins, K., S. Brutton and P. Angelis (1986). “Derivational Complexity and Item
Difficulty in a Sentence Repetition Task.” Language Learning 36, 2, 125-141.
Wise, S., and B. Plake (1989). “Research on the Effects of Administering Tests via
Computers.” Educational Measurement: Issues and Practice 8, 3, 5-10.
I would like to thank Cobuild of the University of Birmingham for access to the Bank
of English corpus.
Alan: Hello Brian, can I ask you a few questions about your English?
Brian: Sure what would you like to know
[done as an example]
Alan: How long have you been studying English?
Brian: For about six years.
Alan: When did you begin?
Brian: I started learning when I was ten
Alan: Well your English is pretty good.
Brian: Come on I make quite a lot of mistakes.
Alan: Mm, not really; are your still studying English now?
Brian: yes I take courses at night
Alan: How often do you have classes?
Brian: Twice a week usually
Alan: And how do you find learning English?
Brian: Its difficult but I enjoy it
Alan: And do you study on your own at all?
Brian: Yes I listen to the radio every evening
Alan: What about English books?
Brian: I read a lot of English books but I prefer speaking
Alan: So who do you speak to in English?
Brian: I try and talk to foreigners on the street when I can
Alan: Good stuff—keep it up!
AUTHOR’S BIODATA
AUTHOR’S ADDRESS
Faculty of Education
Chinese University of Hong Kong
Sha Tin, Hong Kong
Phone: 852 2609 6917
Fax: 852 2603 6129
E-mail: coniam@cuhk.edu.hk