Sie sind auf Seite 1von 20

Language Testing

http://ltj.sagepub.com Scores on a yes-no vocabulary test: correction for guessing and response style
Ineke Huibregtse, Wilfried Admiraal and Paul Meara Language Testing 2002; 19; 227 DOI: 10.1191/0265532202lt229oa The online version of this article can be found at: http://ltj.sagepub.com/cgi/content/abstract/19/3/227

Published by:
http://www.sagepublications.com

Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.sagepub.com/cgi/alerts Subscriptions: http://ltj.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.co.uk/journalsPermissions.nav Citations http://ltj.sagepub.com/cgi/content/refs/19/3/227

Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Scores on a yesno vocabulary test: correction for guessing and response style
Ineke Huibregtse, Wilfried Admiraal Utrecht University, The Netherlands and Paul Meara University of Wales Swansea, United Kingdom

The use of yesno tests seems to be a promising method for measuring the size of receptive vocabulary knowledge of learners of a foreign language. Items in a yesno test each consist of either a word or a pseudoword. Participants are asked to indicate whether or not they know the meaning of these words. This article attempts to tackle the problem of determining a meaningful score for this type of test. Such a score should contain correction for guessing as well as for participants response style. Three possible methods are discussed, but none of these measures appear to apply this type of correction. Signal Detection Theory is applied and a new, more accurate index is suggested. Based on theoretical as well as empirical considerations, recommendations are made about the choice for the index to be used in a yesno vocabulary test.

I Introduction: measuring vocabulary size by means of a yesno test Conceptions of second language acquisition theory and second language education change over time. Accordingly, the perceived importance of vocabulary knowledge varies considerably. In spite of this, some researchers who are especially interested in second language vocabulary have continued to develop methods for vocabulary measurement (for an overview, see Read, 1997). One of the central issues in second language vocabulary measurement is the estimation of the vocabulary size of language learners. In the case of receptive word knowledge, a possible way to do this is by means of a yes no test. A typical yesno test consists of two different kinds of items: real words and pseudowords. Pseudowords are words that full the phonological constraints of the language but do not bear meaning. The items of a yesno test each consist of one word and are presented visually.
Address for correspondence: Ineke Huibregtse, IVLOS Institute of Education, Utrecht University, PO Box 80127, 3508 TC Utrecht, The Netherlands; email: i.huibregtseKivlos.uu.nl
Language Testing 2002 19 (3) 227245
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

10.1191/0265532202lt229oa 2002 Arnold

228

Scores on a yesno vocabulary test

Learners are asked to indicate whether or not they know the meaning of the word and to answer with `yes or `no. Participants know that the test contains non-existing words, not how many nor their location in the test. Although there are disadvantages to this test format (see, for example, Chapelle, 1998), yesno tests also have some advantages regarding practicality. Completing a yesno test requires little time per item. It is therefore possible to use a large number of items, which permits the construction of a reliable measurement of vocabulary size. Probably this practical advantage encourages researchers to use this type of test. A thorough discussion about the advantages and disadvantages of a yesno test lies beyond the scope of this article. The focus of this article is a discussion about the calculation of a score on a yesno test on the basis of the responses of testees. Because of the nature of the test it is not obvious how to determine the test score. Just counting the number of correct answers seems to be too simple. Since there are two types of items and two possible responses (`yes and `no), there are four possible stimulusresponse combinations. Figure 1 reveals that there are two types of correct responses: `yes in the case of a real word (hit) and `no if the item
Response alternative

w
Stimulus alternative

P(Y|w)

P(N|w)

P(Y|p)

P(N|p)

Figure 1 Stimulusresponse matrix Notes: The stimulus alternatives are w for word and p for pseudoword. The response alternatives are yes (Y) and no (N).
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara

229

contains a pseudoword (correct rejection). Incorrect responses are `no to a real word (miss) and `yes in the case of a pseudoword (false alarm). An example of a yesno test that measures vocabulary knowledge is the `English as a Foreign Language Vocabulary Test (EFL vocabulary test), developed by Meara (1992a) for estimation of the receptive English word knowledge of foreign language learners. The EFL vocabulary test consists of a number of tests on six different levels that are based on word frequency in written texts (Hindmarsh, 1982; Nation, 1986). The rst test level is based on the 1000 most frequently occurring English words, the second level on the next 1000 words, etc. For each frequency range of 1000 words there are 10 parallel tests each of which consists of 60 items: 40 real words (a random selection from the frequency range concerned) and 20 pseudowords. The pseudowords consist of syllables of words from the frequency range involved that are put together at random. The resulting pseudowords are judged by native speakers on their consistency with the phonological rules of English. Pseudowords that appeared to meet the phonological rules are included in the test, taking into account the number of syllables. As the EFL vocabulary test is used in various studies, it is important to nd an accurate way to estimate a meaningful score on the basis of the participants responses. For the EFL vocabulary test, Meara (1992b) suggests a score (D m) based on Signal Detection Theory (SDT). The accuracy of the measure D m is under discussion because of two salient characteristics. First, for moderate performance, the score very rapidly approaches the value 0, even if the performance is well above chance level. Secondly, small differences in response behaviour may cause large differences in scores, in particular when the number of `yes responses to real words is small. For example, if half of the real word items and none of the pseudowords are answered with `yes, the value of D m is 0.50. However, if instead the participant said `yes to one pseudoword item, then the value of the score drops to 0.37. The difference between these two values amounts to more than 10% of the total score range, which seems to be disproportionate to the actual difference in response behaviour, namely, one false alarm. Indeed, the observed properties of D m suggest the need to explore the possibility of nding a better way to determine a score on the EFL vocabulary test. In this article four possible methods of calculating the score on the EFL vocabulary test are discussed. The present study aims to consider which of these methods most adequately estimates the score on the EFL vocabulary test.
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

230

Scores on a yesno vocabulary test

II Calculating the score on the EFL vocabulary test Calculation of the score on a yesno test is rst determined by the nature of the test. A yesno test includes two types of correct answers and two types of incorrect answers that can have different implications for measuring a participants word knowledge. Also, variables other than word knowledge that might inuence the participants responses need to be considered. Two variables are particularly important. The rst is `guessing. For each item participants choose from two response alternatives. In the case of doubt, they have the opportunity to guess. Word knowledge is not all or nothing, but rather gradual: it is possible that a participant thinks he or she knows the word, but is not sure of the exact meaning of the word. This means that in general a participant will not guess completely at random (see Anderson and Freebody, 1983). So, we prefer to use the term `sophisticated guessing, rather than guessing at random. Sophisticated guessing (considering the probability of the response alternatives) is something that, according to Signal Detection Theory, everybody will use when confronted with a test in which knowledge is more gradual than all or nothing. The second mediating variable is the participants individual `response style (Nunnally and Bernstein, 1994). When in doubt, certain participants may tend to say `yes whereas other participants will say `no, if anything, unless they are absolutely certain about the meaning of the word. The response style has consequences for the participants response behaviour on both pseudowords and real words. Someone with a conservative response style would not say `yes very quickly on both a pseudoword and a real word. Response style is an individual trait. This means that the calculation of a score on a yesno test has to meet various conditions. The test score should take into account that: there are two types of correct and two types of incorrect answers; participants have the possibility of (sophisticated) guessing; and participants show different response styles.

Four possible methods of calculating the score are discussed here, each of which takes into account at least one of these variables in one way or another. These methods are `the number of correct responses, `correction for guessing, D m (Meara, 1992b) and a new index derived from SDT. In assessing these methods it is important to keep in mind what the instruction is that testees are being given in the EFL vocabulary test. They are asked to indicate whether or not they know the meaning of the given words. They know that the test contains non-existing words. In this context, saying `no would mean that the participant
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara

231

is answering in a conservative manner. You cannot make a mistake by saying `no. Answering `yes however, might yield a false alarm. 1 The number of correct responses The most straightforward method would be to count the number of correct responses. Correct responses are the hits and the correct rejections of pseudowords (see Figure 1). However, these two types of responses are correct in different ways and for different reasons, and it does not seem appropriate to consider them equivalent. This procedure does not meet the rst condition mentioned earlier. The situation could arise where participants who say `yes to 58 out of 60 items (for instance, 39 hits and one correct rejection) obtain exactly the same score as participants who have 20 hits and also 20 correct rejections. Counting the number of correct responses does not permit discrimination between these very different types of response behaviour. Thus, it must be concluded that counting the number of correct responses does not yield an adequate score on a yesno test. Instead, it seems to make more sense to calculate the proportion of hits minus the false alarm rate. In this case, participants increase their test score with hits and at the same time are penalized for false alarms. However, this estimate is not accurately corrected for the participants individual response style (the third condition). For example, a participant who scores eight hits (a hit rate of 0.20) and no false alarms, obtains a score of 0.20. This kind of response behaviour represents quite a conservative response style as the participant rarely says `yes. The number of words that the participant actually knows is probably higher. This is not taken into account in this method of calculating the test score. The score is adjusted only if a participant takes risks and has a false alarm rate larger than 0. 2 Correction for guessing Several researchers (Anderson and Freebody, 1983; Meara and Buxton, 1987) applied correction for guessing, in order to calculate the score on a yesno test for vocabulary knowledge. This method is based on the `blind guessing model, a model of correcting scores for guessing behaviour. The basic assumption in this model is that there are two possibilities for each item: either the participant knows the correct response, in which case the probability of a correct response is 1, or he or she guesses completely at random. In the latter 1 case the probability of a correct response is , where K refers to the K number of response alternatives (see, for example, Nunnally and Bernstein, 1994).
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

232

Scores on a yesno vocabulary test

Green and Swets (1966) argue that according to the blind guessing model each false alarm in a yesno test is the result of guessing behaviour. Thus, the observed false alarm rate is the proportion of `lucky guesses. The equation for the `blind guessing model in terms of probabilities of hits and false alarms can be formulated as: 1) P(h) = P*(h) + P(f) [1 P*(h)]

where, P(h) = the observed hit rate; P*(h) = the true hit rate; P(f) = the false alarm rate. Equation (1) can be reformulated as the equation of correction for guessing in order to be able to estimate the true hit rate (see Green and Swets, 1966; Anderson and Freebody, 1983): 2) P*(h) = P(h) P(f) 1 P(f)

There are several issues and specic problems concerning the blind guessing model that need to be considered. The rst problem is linked to the third condition that the score should meet: the estimate of the score does not take into account possible individual preferences for one response over the other (response style). The second issue is a conceptual one concerning the assumption that a `yes response means that either the participant really knows the word, or he or she makes a completely random guess. This assumption does not allow for various degrees of knowing, apart from perfect knowledge and absence of knowledge. So, this model does not take into account sophisticated guessing and therefore does not meet the second condition for an accurate calculation of the score. The third issue concerns some peculiarities of the equation that are not easily solved. In the event that the observed hit rate equals 1.0, the corrected hit rate will also equal 1.0, regardless of the false alarm proportion. This means that participants who show correct responses on all items (`yes to all real words and `no to all pseudoword items) would obtain the same score as participants who give `yes responses to all real words as well as to most of the pseudowords. It is clearly unjustiable for participants with a false alarm proportion larger than 0 to obtain a score of 1.0. In the event that the false alarm rate equals 1.0, the equation cannot be solved. On the basis of the problems with the blind guessing model in the context of a test of vocabulary knowledge, one can doubt the suitability of this model for the estimation of the score on the EFL vocabulary test. It does not meet the conditions of sophisticated guessing and the individual response style.
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara

233

3 Signal Detection Theory: Mearas D m Signal Detection Theory (SDT) provides an advanced model of correction for guessing behaviour (Nunnally and Bernstein, 1994). This model allows for sophisticated guessing through the assumption that each response alternative possesses a certain degree of credibility and that participants will choose the most plausible alternative. This model in fact provides an alternative for the all-or-none assumption of the blind guessing model and therefore fulls the second condition of calculating the test score. SDT is not generally utilized in applied linguistics and the suitability of the theory in this context has not been discussed to a great extent. The theory is applied particularly in experiments concerning word recognition (Hoshino, 1991; MacLeod and Kampe, 1996), but examples of applications concerning word knowledge can also be found (Phillips and Grodsky, 1985; Behrend, 1988). Meara (1992b) suggests using SDT for calculating the score on the EFL vocabulary test and introduces a formula to calculate such a score. a Core features of SDT: For a vocabulary test with a yesno format the participants response behaviour can be described in terms of a stimulusresponse matrix, as in Figure 1. It should be noted that the entries in the four cells of this matrix are conditional probabilities. Only two numbers are independent, one in each row. This means that the matrix has only two degrees of freedom. Therefore, all information can be represented by one point in a two-dimensional graph, with the co-ordinates labelled P(Yu w) (the hit rate) and P(Yu p) (the false alarm rate). The hit rate, the false alarm rate, and their interrelationship do not only depend on the participants word knowledge and individual response style, but also on their knowledge about the consequences of a certain response. A change of the test instructions for example an additional reward for `yes responses can inuence response behaviour. Modication of decision behaviour results in a change in the ratio between the hit rate and the false alarm rate and, thus, in a new data point in the graph. By applying a variety of instructions a series of data points can be generated, one for each possible set of instructions. The points can be connected by a curve, which has been called the Receiver Operating Characteristic (ROC) curve. The two ends of the ROC curve represent the two most extreme types of response behaviour: (0,0) results from a 100% `no response, while (1,1) is the result of `yes responses to every item. Green and Moses (1966: 22930) and Green and Swets (1966: 45 50) show that the area under the ROC curve of a yesno task equals the percentage of correct responses in a corresponding two-alternative
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

234

Scores on a yesno vocabulary test

forced-choice task. The latter includes two stimuli from which participants have to choose the `correct one. Green and Swets have formulated this relationship on a theoretical basis. Empirical evidence comes from a recognition task conducted by Green and Moses (1966) and a detection task involving aural signals (Emmerich, 1968, described in Pollack and Hsieh, 1969). Green and Swets (1966: 50) reach the conclusion that the area under the ROC curve `provides a convenient and simple index of the detectability of the signal and therefore represents the participants abilities to distinguish signals. Green and Swets (1966) and Pastore and Scheirer (1974) show that this measure is nonparametric. In the context of the EFL vocabulary test, this is useful because there is no complete information available about the frequencies of hits and false alarms. This nonparametric index is based on the geometry of the `unit square (a square of which the area represents one unit). Within this square the hit rate (y axis) can be plotted as a function of the false alarm rate (x axis). If the test is administered only once and thus with only one set of instructions there is just one data point available. An example is provided in Figure 2. A graph may be constructed by drawing a line connecting the data point P (x,y) with point A (0,0) and point C (1,1) respectively. If the hit rate exceeds the false alarm rate, the data point P is located within the triangle ABC. Guessing

Figure 2

A ROC-curve in the unit square


Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara

235

Figure 3

Datapoint plotted in the unit square

completely at random which in theory involves equal probabilities of hits and false alarms produces points that fall on the positive diagonal AC. Point B represents a hit rate of one and a false alarm rate of 0. Point D represents the opposite: a hit rate of 0 and a false alarm rate of 1. The index is based on the area of the quadrangle APCD. Pollack and Norman (1964) argue that this index does not simply equal the area under the graph, because there is just one data point available. Therefore, the precise shape of the graph is unknown. They suggest an area A9 (called Ag by Pastore and Scheirer (1974) and in early publications by Meara) being the average of the largest and the smallest area that can be associated with a certain data point. Grier (1971) describes a method for calculating this average. Figure 3 shows two reference lines, one through the data point P and (0,0) and another through P and (1,1). These two lines produce two triangles, A1 and A2. These two triangles determine the range of all possible ROC curves through the data point involved. The four line segments dene the upper and lower limits of the area to be calculated (see Grier, 1971: 42425). The equation suggested by Grier is then:
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

236 3)

Scores on a yesno vocabulary test A9 = I + 1 (A + A2) 2 1

where I is the area under the solid lines, representing the minimal area. Using the co-ordinates from the unit square (see Appendix 1 for a description of the derivation) the equation can be rewritten as: 4) A9 = 1 (y x) (1 + y x) + 2 4y(1 x)

In terms of proportions of hits and proportions of false alarms the equation can be rewritten as: 5) A9 = 1 (h f) (1 + h f) + 2 4h(1 f)

This equation makes sense only when the hit proportion (h) is equal to or exceeds the false alarm rate (f). If not, the performance is below chance level and a score is not calculated. When both proportions are the same, the areas A1 and A2 equal 0. In that case, the data point lies on the positive diagonal of the square (the diagonal AC). The corresponding value of A9 (0.5) is the minimal value. Thus, the area under the ROC curve ranges from 0.5 (in which case the hit rate equals the false alarm rate) to 1.0 (where the hit rate is 1 and the false alarm rate is 0). b Mearas D m: The formula used by Meara (1992b) estimates the hit rate that the participants would have scored if they had not said `yes to any of the pseudowords. The measure D m is a transformation of A9 (4A9 3). 6) D m= (h f) (1 + h f) 1 h(1 f)

One of the problems of this method is that the test score is not corrected for the participants response style. What is in fact determined is the number of hits that would have been scored by participants who only say `yes if they are absolutely sure of knowing the meaning of the word and therefore do not have any false alarms. In the case of f = 0, the score D m equals the value of h. This means that the test score equals a hit rate of a participant with an extremely conservative response style or, in other words, the hit rate of a participant who only says `yes if there is no doubt about the meaning of the word. Therefore, the corrected hit rate in this model will probably be an underestimation of the true hit rate. This is obvious when we rewrite Mearas formula as:
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara 7) D m=

237

(h f) f (1 f) h The rst term of the Meara formula equals the formula for correction for guessing (example 2). This means that the score D m equals the f `correction for guessing minus the ratio . The formula has a similar h effect as correction for guessing: the false alarm rate even decreases the test score twice. This explains the observed large differences in scores with relatively small differences in false alarm rates. Another result of the D m formula is that it only results in valid scores if the area under the ROC curve equals at least 0.75. Although the index D m has a range from 0 to 1, it includes only part of the response behaviour above chance level. We conclude that the underlying model of SDT meets the conditions of calculating a test score, but that this is not the case for the formula of D m: its result resembles the effect of the blind guessing model and it does not correct for individual response style in an adequate way. In most cases, such a correction of the score will result in an underestimation of the true score. Below, we propose a new index that corrects the test score for the participants response style. 4 Signal Detection Theory: a new index The measure A9 is an index of the performance on the test. This index takes into account both the ratio of hits and false alarms and the effect of sophisticated guessing. The model assumes that each alternative has a certain credibility and that a participant considers the alternatives when he or she is in doubt. The measure is not yet corrected for the participants response style. Hodos (1970) describes a procedure for the evaluation of response style, based on the unit square in which the probability of a hit is plotted as a function of the probability of a false alarm. According to Hodos, a point falling on the y axis at the left of the square represents a maximal inclination to say `no. In this case, the probability of a false alarm is 0, as in the case of doubt the response will be `no. A point falling on the x axis at the top of the square represents a maximal tendency to say `yes. When in doubt, the response is `yes, resulting in a probability of a hit equal to 1 (the participant either knows the meaning of the word or hesitates and says `yes) and a false alarm rate equal to the degree in which the participant hesitates about the pseudowords. Between the two axes the negative diagonal of the unit square (BD in Figure 2) represents `unbiased performance. Hodos states that a point that lies on this diagonal represents a ratio of hits and false alarms associated with an unbiased response style.
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

238

Scores on a yesno vocabulary test

This means that, in order to estimate the test score corrected for response style on the basis of a data point, it is necessary to calculate the intersection of the ROC curve of the participant with the negative diagonal. Grier (1971: 425) gives the co-ordinates of this point: 8) (x, y) =

2(1 A9 ) 2(1 A9 ) ,1 3 2A9 3 2A9

Inserting the equation for A9 (5) into equation (8) results in a formula that determines the y co-ordinate: 9) y=1 2h(1 f) (h f) (1 + h f) 4h(1 f) (h f) (1 + h f)

This co-ordinate is in fact the true hit rate, the hit rate that would have been scored if the participants response style had been perfectly unbiased. Thus, the value of the y co-ordinate is effective as the score on the EFL vocabulary test. The lowest possible value of this coordinate is 0.5 because the lower half of the negative diagonal lies in the area of the unit square that represents scores below chance. The new index for calculating the score on the EFL vocabulary test is a linear transformation (2y 1) of equation (9), resulting in a score with a range from 0 to 1. This formula is: 10) ISDT = 1 4h(1 f) 2(h f) (1 + h f) 4h(1 f) (h f) (1 + h f)

III A comparison based on examples It is interesting to see if the new score ISDT yields different values from the other three indices, namely, the hit rate minus the false alarm rate (h f), correction for guessing (cfg), and Mearas D m. We therefore compared several combinations of hit and false alarm proportions. Appendix 2 contains a table with scores based on the different measures for various hit rates (intervals of 0.10) and false alarm rates (with intervals of 0.05). The different indices yield similar values for a performance at chance level (when both rates are equal). For all measures that in this case yield valid scores the test score value is 0. The score based on D m would have a value below 0 and is therefore discarded. For a very good performance (a maximum hit rate in combination with no false alarms or a low false alarm rate) the values of the different indices are similar as well. A participant has a score of 1.0 or nearly 1.0. The indices D m, h f and ISDT reach the value of 1.0 only if the participant answered all items correctly (h = 1 and f = 0).
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara

239

A test performance with a hit rate of 0.80 and a false alarm rate of 0.20 results in different scores depending on the measure being used. In this specic case, the index h f results in the same score as the ISDT. In other, but similar, cases h f results in a score almost equal to the value of the index ISDT. The score derived from D m is always lower than the values of the other indices. This can be explained by the fact that D m yields the test score of a participant with an extremely conservative response style. Thus, D m produces a value that always underestimates the true score, whereas ISDT corrects for a conservative response style when the false alarm proportion is low. The value produced by cfg is higher than the value of ISDT because correction for guessing stresses the hit rate. This is particularly manifest when the hit rate equals 1: regardless of the number of times the participant says `yes to a pseudoword, the score will be 1 as well. In the case of a high hit rate, cfg yields a comparatively high score, especially when the high hit rate coincides with a fairly low false alarm rate. When a participant says `yes on both real words and pseudowords and both proportions are large, both h f and D m result in lower values than ISDT. Proportions of similar sizes along with an average hit rate result in a value below 0 when D m is applied. The value of this index drops drastically with decreasing performance, in particular in the case of small hit proportions, and reaches 0 even when performance is still well above chance level. On the other hand, the value of cfg is rather high, as it does not correct for the participants liberal response style. The value of h f is exactly or nearly the same as the value of the index ISDT, although it does not include a correction for response style. Participants who rarely say `yes, and thus show a low hit rate and a low false alarm rate, obtain a higher score according to the index ISDT than they would if one of the other indices were applied. This is not surprising since ISDT corrects the score for a conservative response style. The values of both cfg and h f are low, the former because of the small hit proportion, the latter as a result of the small difference between the two proportions. As in the other cases of poor performance, D m yields an invalid test score. IV Discussion and conclusions A yesno test offers an instrument for measurement of the size of passive vocabulary knowledge of foreign language learners. On the basis of the proportions of hits (`yes responses to real words) and false alarms (`yes responses to pseudowords ) the test score can be estimated. The calculation of a test score should take into account the fact that there are two types of correct and incorrect answers; there
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

240

Scores on a yesno vocabulary test

is likely to be sophisticated guessing, and participants show different response styles. In this article, several methods of estimating a test score have been discussed, in particular the score on the English as a Foreign Language Vocabulary Test (EFL vocabulary test) that was developed by Meara (1992a). The measures discussed include h f (the hit proportion minus the false alarm rate), `correction for guessing and D m, a measure proposed by Meara (1992b). None of these measures adequately take into account the conditions for calculating a score on a yesno test. We propose a new index that is based on SDT. This index, the ISDT, involves the proportions of hits and false alarms, and a correction for sophisticated guessing and the participants response style. Comparison of the measures on the basis of the examples reveals that the procedures of correction for guessing and D m often result in values different from the values of ISDT. More specically, the measure D m always yields an underestimation of the intended standard, whereas correction for guessing gives an overestimation for large hit proportions. Despite the fact that h f does not contain correction for response style, this measure produces values that in most cases are comparable and sometimes even identical to the values of ISDT. Exceptions are cases in which both the hit rate and the false alarm rate are either very high or very low. In these cases the value of h f is lower than the value of ISDT, as it does not correct for the manifested extreme response style. An advantage of h f is that it is easy to calculate the index and to explain the procedure. In practice, the exceptions for which h f and the new index give different results are not very common. For this reason, it is possible to argue on empirical grounds that h f is a practical measure in, for example, the context of small-scale applications of the yesno test in education. An important disadvantage of h f is the fact that the score is not corrected for the participants response style. The new index yields a test score that is corrected for guessing and response style and is accurate for every possible type of response behaviour. We would therefore propose using the new measure in research applications. However, we would like to discuss some problems with SDT and calculating a score on a yesno test, referring to the assumptions of SDT and the validity for the word knowledge test. One of the basic assumptions of SDT is a xed stimulus condition, meaning a series of similar stimuli from which the participant should distinguish between two classes, one containing a signal and one containing just noise. However, it is not self-evident that the items of a yesno vocabulary test can be considered as similar stimuli and that there is no need to consider possible differences in difculty between
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara

241

the items. One of the assumptions underlying the EFL vocabulary test is that word frequency determines the difculty of the item. Relying on the fact that all real words from an individual test belong to one and the same frequency range, it can be argued that these items can be considered similar. This is less obvious for the pseudowords, but these items are put together in such a way that they show characteristics of the real words of the frequency range involved (for instance concerning the number of syllables). A second, possibly problematic, assumption of SDT is the translation of the participants response style into types of response behaviour. As described, Hodos (1970) argues that maximal inclination to say `no in the case of doubt entails a probability of a false alarm equal to 0. This is based on the supposition that a participant confronted with a pseudoword has two alternatives: either hesitating or saying `no with certainty. Saying `yes with certainty to a pseudoword is not considered to be an option. Whether this is correct is open to discussion. However, it is highly unlikely that a participant will be certain about knowing a pseudoword. Something similar holds for a maximal inclination to say `yes. In this case, according to Hodos, the probability of a hit is 1, because also in the case of a real word a participant is supposed to have a choice from two alternatives (say `yes with certainty or hesitate). If a testee has a strong tendency to say `yes, hesitation will eventually result in a `yes response. However, it is not inconceivable that a participant might say `no to a real word without hesitation; that is to say, participants might know for sure that they do not know the meaning of the word or that they do not recognize the word. However, we expect that, being in a test setting, which means participants will have an achievement-oriented attitude, will cause participants to show at least some tendency to hesitate if they do not know a word. A third assumption of SDT that could raise questions is the assertion that scores below chance level could only be generated by measurement errors or by the participant (deliberately) making mistakes in completing the test form. This means that no test score is computed if the false alarm rate exceeds the hit rate. There might be (rare) occasions when we would want to accept scores below chance level as valid scores; for example, when a very low level of word knowledge coincides with a strong tendency to say `yes in the case of doubt. In this case the false alarm rate may be slightly higher than the hit rate. To consider a score below chance level as valid would mean that information about such, not very common, cases could be preserved. However, in practice it would be difcult to distinguish these cases from cases of malingering. It therefore seems justiable to discard scores below chance performance in order to avoid
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

242

Scores on a yesno vocabulary test

allowing these participants valid marks. As a result of the assumption that a score is invalid when the hit rate is lower than the false alarm rate, the index ISDT never falls below 0.5 (or 0 in the case of the linear transformation applied in this article). The question of what would be an appropriate interpretation of the test score remains. Meara (1992b) argues that the score D m represents the proportion of the words from the frequency range involved known by the participant. In the case of ISDT, the score refers to a hit rate that is corrected for sophisticated guessing and response style. The index ISDT is an accurate measure and is suitable for comparing groups of participants and for expressing the development of vocabulary size of language learners. Further research should be conducted in order to determine what the test score really means in practice; for example, for the admission to and completing of courses or training. One possibility would be to conduct a study on the relationship between the score on the yesno test and other, already standardized measures of language prociency. V References
Anderson, R.C. and Freebody, P. 1983: Reading comprehension and the assessment and acquisition of word knowledge. In Hutson, B.A., editor, Advances in reading/ language research. Greenwich, CT: JAI Press, 132255. Behrend, D.A. 1988: Overextensions in early language comprehension: evidence from a signal detection approach. Journal of Child Language 15, 6375. Chapelle, C. 1998: Construct denition and validity inquiry in SLA research. In Bachman, L.F. and Cohen, A.D., editors, Interfaces between second language acquisition and language testing research. Cambridge: Cambridge University Press, 3270. Emmerich, D.S. 1968: Receiver-operating characteristics determined under several interaural conditions of listening. Journal of the Acoustical Society of America 43, 298307. Green, D.M. and Moses, F.L. 1966: On the equivalence of two recognition measures of short-term memory. Psychological Bulletin 66, 22834. Green, D.M. and Swets, J.A. 1966: Signal detection theory and psychophysics. New York: John Wiley and Sons. Grier, J.B. 1971: Nonparametric indexes for sensitivity and bias: computing formulas. Psychological Bulletin 75, 42429. Hindmarsh, R. 1982: Cambridge English lexicon. Cambridge: Cambridge University Press. Hodos, W. 1970: Nonparametric index of response bias for use in detection and recognition experiments. Psychological Bulletin 74, 35154. Hoshino, Y. 1991: A bias in favor of the positive response to high-frequency words in recognition memory. Memory and Cognition 19, 60716.
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara

243

MacLeod, C.M. and Kampe, K.E. 1996: Word frequency effects on recall, recognition, and word fragment completion tests. Journal of Experimental Psychology Learning, Memory, and Cognition 22, 13242. Meara, P. 1992a: EFL vocabulary tests. Swansea: Centre for Applied Language Studies, University College Swansea. 1992b: New approaches to testing vocabulary knowledge. Draft paper. Swansea: Centre for Applied Language Studies, University College Swansea. Meara, P. and Buxton, B. 1987: An alternative to multiple choice vocabulary tests. Language Testing 4, 14254. Nation, I.S.P., editors, 1986: Vocabulary lists (Revised edn). Wellington: Victoria University English Language Institute. Nunnally, J.C. and Bernstein, I.H. 1994: Psychometric theory. 3rd edn. New York: McGraw-Hill. Pastore, R.E. and Scheirer, C.J. 1974: Signal detection theory: considerations for general application. Psychological Bulletin 81, 94558. Phillips, G.W. and Grodsky, M. 1985, March-April: Testing Piagets theory of probability concept development: a Bayesian approach using the theory of signal detection. Paper presented at the 69th Annual Meeting of the American Education Research Association, Chicago. Pollack, I. and Hsieh, R. 1969: Sampling variability of the area under the ROC-curve and of de . Psychological Bulletin 71, 16173. Pollack, I. and Norman, D.A. 1964: A nonparametric analysis of recognition experiments. Psychonomic Science 1, 12526. Read, J. 1997: Assessing vocabulary in a second language. In Clapham, C. and Corson, D., editors, Language testing and assessment. Dordrecht: Kluwer.

Appendix 1

The derivation of the expression for computing A9

The derivation of the expression and the characters used in the derivation refer to Figure 3 above. 1 1 (A1 + A2) = PKDE + PCK + PEA + (A1 + A2) 2 2 (1 x) (1 y) xy PKDE + PCK + PEA = y(1 x) + + 2 2 (1 x) (1 y) A1 = PJH 2 xy A2 = PFG 2 A9 = I + The triangles PJH and PEA are identical in shape (i.e., the ratios between the sides of each of the two triangles are the same), as are the triangles PFG and PCK. This means that the long side (PH) of triangle PJH equals
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

244

Scores on a yesno vocabulary test

1y * x. The long y side (GP) of triangle PFG equals x and the short side (FG) of this triangle x equals (1 y). The area of PJH is then: 1x * (1 y) while the short side (JH) of PJH equals 1 x(1 y) x(1 - y) (1 - y) = * (1 y) * 2 2y y and the area of PFG is: 1 x(1 y) x*x(1 y) * x * (1 x) = 2(1 x) 2 Substitution of all individual elements produces the expression for A9 : 1 (y x) (1 + y x) A9 = + 2 4y(1 x) Appendix 2 Values of the different measures for various hit rates and false-alarm rates

S D

S D

h 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.90 0.90 0.90

f 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.00 0.05 0.10

hf 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.90 0.85 0.80

cfg 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.90 0.89 0.89

D m 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.90 0.84 0.78

new 1.00 0.95 0.90 0.86 0.82 0.78 0.74 0.70 0.67 0.63 0.60 0.57 0.54 0.51 0.48 0.45 0.43 0.40 0.38 0.36 0.90 0.85 0.80

h 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60

f 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60

hf 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

cfg 0.63 0.60 0.57 0.54 0.50 0.45 0.40 0.33 0.25 0.14 0.00 0.60 0.58 0.56 0.53 0.50 0.47 0.43 0.38 0.33 0.27 0.20 0.11 0.00

D m 0.34 0.24 0.14 0.04 0.60 0.50 0.39 0.28 0.17 0.05

new 0.50 0.45 0.40 0.35 0.30 0.25 0.21 0.16 0.11 0.06 0.00 0.67 0.60 0.53 0.47 0.41 0.36 0.30 0.25 0.20 0.15 0.10 0.05 0.00

Continued
Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Ineke Huibregtse, Wilfried Admiraal and Paul Meara


h 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.70 0.70 0.70 0.70 f 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.00 0.05 0.10 0.15 hf 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.70 0.65 0.60 0.55 cfg 0.88 0.88 0.87 0.86 0.85 0.83 0.82 0.80 0.78 0.75 0.71 0.67 0.60 0.50 0.33 0.00 0.80 0.79 0.78 0.76 0.75 0.73 0.71 0.69 0.67 0.64 0.60 0.56 0.50 0.43 0.33 0.20 0.00 0.70 0.68 0.67 0.65 D m 0.72 0.65 0.59 0.52 0.46 0.39 0.32 0.24 0.17 0.08 0.80 0.73 0.65 0.58 0.50 0.42 0.34 0.25 0.17 0.07 0.70 0.61 0.52 0.43 new 0.75 0.70 0.66 0.62 0.57 0.53 0.49 0.45 0.41 0.37 0.33 0.29 0.24 0.18 0.11 0.00 0.82 0.76 0.70 0.65 0.60 0.55 0.50 0.46 0.41 0.37 0.32 0.28 0.23 0.18 0.13 0.07 0.00 0.74 0.68 0.62 0.56 h 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.30 0.30 0.30 0.30 0.30 0.30 0.30 0.20 0.20 0.20 0.20 0.20 0.10 0.10 0.10 0.00 f 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.00 hf 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.20 0.15 0.10 0.05 0.00 0.10 0.05 0.00 0.00 cfg 0.50 0.47 0.44 0.41 0.38 0.33 0.29 0.23 0.17 0.09 0.00 0.40 0.37 0.33 0.29 0.25 0.20 0.14 0.08 0.00 0.30 0.26 0.22 0.18 0.13 0.07 0.00 0.20 0.16 0.11 0.06 0.00 0.10 0.05 0.00 D m 0.50 0.37 0.24 0.11 0.40 0.24 0.08 0.30 0.10 0.20 0.10

245
new 0.60 0.52 0.45 0.38 0.32 0.26 0.21 0.15 0.10 0.05 0.00 0.54 0.45 0.37 0.30 0.23 0.17 0.11 0.05 0.00 0.48 0.38 0.29 0.20 0.13 0.06 0.00 0.43 0.29 0.18 0.08 0.00 0.38 0.16 0.00

Notes: The symbol h refers to the hit rate, f is the false-alarm rate, h f stands for the hit rate minus the false-alarm rate, cfg refers to correction for guessing, D m is the measure proposed by Meara, and new refers to the new index.

Downloaded from http://ltj.sagepub.com by Tomislav Bunjevac on September 9, 2009

Das könnte Ihnen auch gefallen