Beruflich Dokumente
Kultur Dokumente
ARTICLE
PSYCHOMETRIC AND PSYCHOLOGICAL EFFECTS OF ITEM SELECTION AND REVIEW ON COMPUTERIZED TESTING
JAVIER REVUELTA, M. CARMEN XIMNEZ, AND JULIO OLEA Autonoma University of Madrid
Psychometric properties of computerized testing, together with anxiety and comfort of examinees, are investigated in relation to item selection routine and the opportunity for response review. Two different hypotheses involving examinee anxiety were used to design test properties: perceived control and perceived performance. The study involved three types of administration of a computerized English test for Spanish speakers (adaptive, easy adaptive, and fixed) and four review conditions (no review, review at end, review by blocks of 5 items, and review item-by-item). These were applied to a sample of 557 first-year psychology undergraduate students to examine main and interaction effects of test type and review on psychometric and psychological variables. Statistically significant effects were found in test precision among the different types of test. Response review improved ability estimates and increased testing time. No psychological effects on anxiety were found. Examinees in all review conditions considered more important the possibility of review than those who were not allowed to review. These results concur with previous findings on examineespreference for item review and raise some issues that should be addressed in the field of tests with item review. Keywords: item selection; item review; computerized adaptive tests; computerized fixeditem tests
There is a strong tradition of research in the optimal design of the testing scenario for improving examinees comfort and reducing anxiety, with the purposes of obtaining a purer measure of ability through the elimination of distracting elements and of increasing the social acceptance of testing. In particular, anxiety reduction is important because it improves concentration on the task, eliminates spurious errors, and provides a more faithful reflection of
This research has been supported in part by DGICYT grant PB 97-0049.
Educational and Psychological Measurement, Vol. 63 No. 5, October 2003 791-808 DOI: 10.1177/0013164403251282 2003 Sage Publications
791
792
ability. An important aspect in anxiety reduction involves giving examinees the opportunity to review and change their responses. Item review has been common in conventional paper-and-pencil tests but is less usual in computerized tests. The effects of item review in paper-and-pencil tests have been studied for some 70 years. In general, these studies indicate that most examinees change their responses during review, but only on a small proportion of items, and that test scores tend to improve (see Lunz, Bergstrom, & Wright, 1992; Stone & Lunz, 1994; Vispoel, 1998, 2000; Waddell & Blankenship, 1995; Wise, 1995). Item review is usually prohibited in computerized fixed-item tests (CFITs). A CFIT is a computerized test that administers the same items to all examinees. The absence of review possibility is perceived as frustrating by examinees (Vispoel, Wang, de la Torre, Bleiler, & Dings, 1992) and may thus call into question its equivalence with the paper-and-pencil version of the test (Bunderson, Inouye, & Olsen, 1989; Sykes & Ito, 1997). Two studies have addressed the effects of item review on CFITs. Vispoel (2000) found that the number of answers changed is similar to that of paper-and-pencil tests. Percentage of participants modifying answers is lower, and among them, more than 50% increase their performance in the test. There is a noteworthy inverse linear relation between trait anxiety, on one hand, and performance and ability estimation precision, on the other hand, with nonsignificant effects of review (allowed vs. prohibited) and trait anxiety as regards estimated ability. The second study, by Olea, Revuelta, Ximnez, and Abad (2000), found somewhat different results: increased number of correct responses and improved estimated ability after review whereas standard error remains constant, a decrease in state anxiety after item review, statistically significant effects on testing time, a statistically significant positive linear relation between anxiety, and positive attitude toward item review. Computerized Adaptive Tests (CATs) are based on the idea of tailoring the test to each individual examinee. During the testing session, examinee ability is estimated after each response. This estimate is used to select the next item to be administered from a large item pool, with a view to maximizing the precision of the test score. Typically, CATs do not incorporate item review, a fact not well received by examinees. They perceive a loss of control over the test, which increases their anxiety level and affects their performance on the test (Stocking, 1997; Wise, Freeman, Finney, Enders, & Severance, 1997). Several studies on item review in CATs have appeared over the past 10 years (Lunz & Bergstrom, 1994; Lunz et al., 1992; Olea et al., 2000; Stone & Lunz, 1994; Vispoel, 1998, 2000; Vispoel et al., 1992; Vispoel, Hendrickson, & Bleiler, 2000; Vispoel, Rocklin, Wang, & Bleiler, 1999; Wise, 1995). Results have shown that item review may have positive consequences, mainly that the majority of the examinees manifest a clear preference for review and perceive
REVUELTA ET AL.
793
that the test is fairer. With regard to test scores, results indicate that a large number of examinees modify a small number of items and that these changes tend to be from wrong to right. One possible explanation is that review contributes to relaxation during the testing situation, seen as stressful by the examinee. A second explanation is that item review modifies the psychometric properties of items, decreasing their difficulty. The present article focuses on the first of these possibilities. Permitting item review in tests may also have certain disadvantages. Vispoel et al. (2000) suggested that it complicated test algorithms. Testing time typically increases by between 37% and 61%. More important, some examinees may use item review to obtain a positively biased ability estimate in adaptive tests (Vispoel, 2000; Vispoel et al., 1999). Also, tests takers may develop strategies for obtaining illegitimate scores. For example, the Wainer (1993) strategy consists in answering incorrectly in the first test session, thus obtaining a less difficult test, and answering to the best of ones ability in the review phase. Some studies have addressed this problem and found that limiting the review to certain blocks of items instead of applying it to the entire test avoids distortion of the psychometric properties of test scores (see Gershon & Bergstrom, 1995; Stocking, 1997). The present research is based on two hypotheses about examinee anxiety that are relevant to optimal design of the testing environment:
The perceived performance hypothesis proposes that perceived performance in the task is the variable that determines test anxiety (Ponsoda, Olea, Rodrguez, & Revuelta, 1999). Examinees may obtain information about performance from the number of errors. This theory predicts an indirect effect of item selection method on anxiety that depends on the difficulty of the administered items and the number of errors. The perceived control hypothesis has been proposed to justify self-adapted testing (Wise, 1994). This hypothesis considers the test as a source of stress. Perceived control over the cause of stress produces a reduction in test anxiety, which may optimize performance. According to this theory, item review is seen as a form of control over the test that reduces anxiety.
These hypotheses are used to define several conditions for the application of computerized testing that are expected to produce various effects on examinees anxiety and comfort. Three item selection strategies were used: an adaptive strategy (CAT), an easy-adaptive strategy (ECAT) and a fixed-test strategy (CFIT). The ECAT is a CAT in which items are selected for a lower ability level than the estimated one, and are therefore easier than those of a CAT (Bergstrom & Lunz, 1999; Lunz & Bergstrom, 1994; Ponsoda et al., 1999). According to the perceived performance hypothesis, differences in anxiety should be found among these conditions depending on their impact on test difficulty. It was predicted that the ECAT would produce lower anxiety as compared to the CAT. The CFIT is
794
useful for correlating performance with anxiety because examinees with different ability levels correctly answer a different number of items and therefore perceive different performance. In contrast, in both adaptive conditions the item selection mechanism keeps perceived performance (number of correct responses) constant across participants with different abilities. The study also includes four item review conditions: no review (NR), review at end of test (RE), review by blocks of five items (RB), and review item by item (RI). In the RB condition, items are applied in blocks of five items. The examinee responds to and reviews each block before proceeding with the rest of the test. In the RI condition, review is made immediately after responding to each item. If the perceived control hypothesis is true, the different item review conditions should decrease test anxiety as compared to NR. No clear expectations can be formulated about the differences among the review conditions. The majority of previous studies have studied the psychometric effects of the review at end condition in CATs. The present study incorporates several original features: (a) CATs, ECATs, and CFITs; (b) various item review strategies; and (c) a mixed design involving a between-subjects factor to study the psychological effects of test type and item review, and a withinsubjects factor to analyze state anxiety of the examinee before and after test administration. Interaction effects of test type, review, and occasion (before and after review) can also be studied using this design. Aims of the Research The objective of the present research was to obtain empirical evidence about optimal testing conditions with regard to anxiety and comfort that preserve the psychometric properties of the ability scores. Four specific research questions were addressed:
1. 2. 3. 4. What are the psychometric effects of the different test conditions? The aim is to determine the effects of review and type of test administration on estimated ability and standard error. What are the effects of the different test application conditions on the anxiety of examinees? What are the effects on examinees attitude of item review and comfort with the testing experience? What are the effects of the different conditions on testing time?
Previous research has revealed a variety of effects of item review on the psychometric properties of the ability scores. Olea et al. (2000) found an increase in number of correct responses and an improvement in mean estimated ability, whereas standard error did not change. On the other hand, Vispoel (1998) found an increase in standard error. This difference may be explained
REVUELTA ET AL.
795
by the type of test used in the research, in particular with regard to the imposition of a time limit to respond each item. The present research used the same test as Olea et al. (2000), and an increase in ability after review was therefore expected to be found. Given that participants have neither previous experience of CATs nor any knowledge about them, the possibility of their using test-taking strategies that bias the ability estimate in review conditions is ruled out. However, it is important to determine the increase in testing time that can be expected in real applications under the different review conditions.
Method
Participants The sample consisted of 557 undergraduate psychology students, 100 males and 457 females, from two Spanish universities: Madrid Autnoma University and the University of Santiago. Ages ranged between 17 and 19 years. Before data analysis, 6 participants were eliminated, either because the standard error of estimation was higher than 2.47 or because the estimation algorithm did not converge. The final sample consisted of 551 participants (97 males and 454 females). A primary concern of research on state-anxiety is to establish the most realistic testing conditions. Students should perceive that it is a real testing situation (not an experiment), and that their performance could have some consequences. To reach medium-stakes conditions, (a) the sample consisted of first-year students; (b) data-collection took place in their first 3 weeks of lectures; (c) a professor of psychometric methods persuaded them to complete the English test, and they were told about the importance of English for making progress in the psychology field; and (d) they were informed that the results would be posted on the notice board and that the head of the Psychology Department was in charge of the English assessment test. Each student that agreed to participate registered for a certain day and time to take the test. After data-collection, results were published anonymously, indicating identity number and percentile and giving a brief explanation. Conditions The study used a mixed design, with two between-subjects factors and one within-subjects factor. The between-subjects factors were test type and review. The levels of test type were CAT, ECAT, and CFIT and are described below. The levels of review were NR, RE, RB, and RI. Crossing the two factors resulted in eight conditions. The within-subjects factor was occasion. Anxiety of examinees was measured before and after application of the test.
796
Measures Computerized vocabulary tests. Three forms of an English vocabulary test for Spanish speakers were applied: CAT, ECAT, and CFIT. The tests included 3 example items to acquaint the examinee with the computer, followed by the real test, which has 20 items. Items consisted of an English word and five possible translations into Spanish, only one of which was correct. Examinees choose an alternative with the cursor arrow from the keyboard and press backspace when they are sure of their response. They had 15 seconds to answer each item. In the top right of the screen a clock indicates the available time remaining for answering the item. The item pool consisted of 221 items calibrated with the three parameters logistic model. More details about its construction, calibration, check of assumptions and parameter distributions can be found in Olea, Ponsoda, Revuelta, and Belch (1996) and Ponsoda, Wise, Olea, and Revuelta (1997). The CAT algorithm selects items according to the maximum information principle (Ponsoda, Olea, & Revuelta, 1994). After each answer, the algorithm estimates the provisional ability level of the examinee by conditional maximum likelihood. The entry point of the CAT is an ability chosen at random between .40 and .40. The test assumes that the examinee has given a correct answer in an easy item (b = 4) and an incorrect answer in a difficult one (b = 4) even though these two items have not actually been applied. However, these artificial data permit us to obtain finite maximum likelihood estimates when all actual responses are correct or incorrect. To avoid extreme ability estimates, the algorithm implements the solution proposed by Revuelta and Ponsoda (1997). CAT uses the progressive method for controlling item exposure (see Revuelta & Ponsoda, 1998), which involves adding a random component to item selection in the initial stages of the test. The only stopping rule is a test length of 20 items. The ECAT is a CAT in which items are selected for an ability level equal to estimated ability minus 0.5. In all other aspects, it is identical to the CAT. In a previous study (Olea et al., 1996), the ability distribution in the population of psychology students was found to be normal with mean 0.57 and standard deviation 0.92. To create the CFIT, 20 random values were sampled from this distribution. The CFIT consisted of the 20 most informative items for the sampled abilities, in increasing order of difficulty. These items were the same for all individuals who received the CFIT. The procedure for developing the CFIT can be understood as a simulation approach to maximize information for the population of examinees, instead of maximizing information for specific values, as is the case in adaptive tests. Figure 1 shows the information functions for the CAT, ECAT, and CFIT. Information was computed for values between 4 and 4, separated by 0.1
REVUELTA ET AL.
797
25
20
Information
15
Ability
Figure 1. Information function for the CAT, ECAT, and CFIT.
points. For the adaptive tests, information was computed by selection of the 20 most informative items for (in CAT) and for 0.5 (in ECAT). For the three tests maximum information was attained for values of between 0 and 1. Computerized anxiety measures. State-anxiety of participants was measured with the Spanish version of the State-Anxiety Scale (SAS) from Spielberger, Gorsuch, and Lushene (1970). The 20 items of the scale were split into two equivalent parts; one to be applied before the vocabulary tests (SAS1) and the second to be applied after the vocabulary test to measure posttest state-anxiety (SAS2). The equivalence and factor validity of both parts were studied in Ponsoda et al. (1999). Anxiety increase (AI) is the variable entered in the analysis (AI = SAS2 SAS1). Comfort of the examinees during the test. A computerized questionnaire was used to assess examinees opinions about the vocabulary test and the importance of item review. Four items were administered to assess the importance of review. The first item, denoted as C1, and others read as follows: Review is very important in this kind of test; Item C2, The way review was allowed is appropriate; Item C3, Review has helped to improve my score in the test; and Item C4, Review has helped to make me feel better. Responses to these items were recorded on a five-point Likert-type scale (1 = total disagreement, 2 = disagreement, 3 = neutral, 4 = agreement, and 5 = total agreement). Participants in the no-review condition were presented only with Item C1.
798
Procedure The testing sessions took place in two computer rooms, one with 30 positions and another with 20. Students were randomly assigned to conditions and conditions to computers for each round of data collection. Once the students were seated in front of the computer a researcher gave brief instructions about the procedure and informed the participants that they would receive more instructions as necessary via the computer. The testing session for each examinee was as follows: (a) record name and identity number; (b) general instructions for the SAS1, 3 examples and the 10 items of the SAS1 pretest; (c) instructions and application of the English vocabulary test; (d) application of SAS2; and (e) application of items measuring comfort. The computer gave instructions for the English vocabulary test that varied in the different review conditions. The NR condition had the following instructions:
The purpose of the following test is to find out your English vocabulary level. Each question asks you to translate an English word into Spanish. For each word you will be given five alternatives. You have to select the correct one. You will be given 15 seconds to answer each question. Think carefully before you answer, because once you have given an answer you will not be allowed to change it. Before the test begins you will see three examples.
The instructions for the English vocabulary test in the RE condition were identical except for the last three sentences:
You will be given 15 seconds to answer each question. If you wish to change any of your answers you will be allowed to do so at the end of the test. Before the test begins you will see three examples.
Data Analysis Analysis of answer change was carried out by tabulating percentages of (a) examinees that changed answers; (b) examinees whose ability estimates
REVUELTA ET AL.
799
improved, decreased, or remained the same after review; (c) total answers that were changed; and (d) answers changed from wrong to wrong, wrong to right, and right to wrong. Dependent variables obtained from the vocabulary test were ability estimate, standard error, number of correct responses, and testing time. These were analyzed together with comfort (C1 to C4) in relation to the factors test type and review condition. The effects of test type, review, and occasion on AI were also analyzed. The statistical method was analysis of variance, and was run with the general linear model subroutine of the SPSS software. Multiple comparisons between means were also calculated for main effects using the Tukey multiple comparison test. Pearson correlation between ability and AI was computed in the CFIT condition.
Results
Answer Change Behavior Within the Review Conditions Of the 415 examinees who were allowed to review, 373 (89.9% of the total) chose to do so. Table 1 shows the consequences of review expressed as percentages of examinees changing their responses and percentages of responses changed in the different directions. Table 1 shows, in the row labeled Review, that almost 90% of examinees chose to review. The Improve row shows that some 65% of them benefited from doing so and improved their ability estimate. Percentage of participants that decreased their ability estimate because of review is around 22% in CAT and CFIT, and 16% in ECAT. This difference is due to the lower difficulty level of ECAT. Differences were found among the review conditions in the percentage of participants that reviewed any item, with percentages of 79.7, 87.8 and 91.3 for the conditions RE, RB, and RI. The row labeled Answer Change shows the percentage of responses changed after review. These percentages are above 10% for most conditions, though percentage of changes is maximal in the CFIT. Among the reviewed responses, around 40% are changed from wrong to right in the adaptive conditions, and only about 30% in the CFIT test. The mismatch between ability and difficulty in the CFIT prevents individuals taking advantage of review to improve performance. Effects of Test Type, Item Review, and Occasion on Psychometric Variables Descriptive statistics for the psychometric variables grouped by experimental condition are shown in Table 2. The trend is an improvement in estimated ability, and an increase in standard error and number of correct responses after item review. Results show no clear differences between
800 Table 1 Review and Answer Change Pattern by Test Type and Review Condition CAT RE Review Improve Remain the same Get worse Answer change Wrong - Wrong Wrong - Right Right - Wrong 80 66.7 8.3 25 12.55 42.47 43.35 14.18 RB 84.8 73.2 7.3 19.5 11.74 44.46 44.46 11.07 RI 95.7 57.8 20 22.2 15.22 51.44 38.57 9.99 All 86.9 65.9 11.87 22.23 13.17 46.12 42.13 11.75 RE 76.1 65.7 28.6 5.7 6.09 42.86 53.53 3.61 RB 87.2 63.6 15.9 20.5 10.75 48.51 34.64 16.85 ECAT RI 89.4 61.9 16.7 21.4 10 52.13 40.43 7.45 All 84.3 63.73 20.4 15.87 8.95 47.83 42.87 9.30 RE 83.0 59 20.5 20.5 14.37 54.07 31.11 14.82 RB 91.3 63 8.7 28.3 19.13 60.79 30.68 8.52 CFIT RI 88.9 71.1 6.7 22.2 14.67 61.35 34.83 3.82 All 87.7 64.37 11.97 23.67 16.06 58.74 32.21 9.05 Total 89.9 64.6 14.5 20.9 12.71 52.24 37.53 10.23
Note. Review contains percentage of participants that decide to review any answer for each review condition. Answer change contains percentage of responses changed. All refers to the three review conditions pooled.
Table 2 Means and Standard Deviations for Dependent Variables by Test Type and Occasion Estimated Ability n CAT NR RE RB RI NR RE RB RI NR RE RB RI 46 45 46 46 46 46 47 47 44 47 46 45 551 Before .59 (.66) .53 (.68) .48 (.70) .50 (.71) .58 (.71) .63 (.70) .83 (.65) .43 (.61) .81 (.80) .86 (.72) .63 (.71) .87 (.62) .64 (.69) After .64 (.71) .57 (.72) .62 (.73) .76 (.72) .88 (.67) .53 (.62) .97 (.76) .82 (.63) 1.06 (.62) .76 (.70) Standard Error Before .24 (.02) .24 (.02) .24 (.02) .25 (.03) .26 (.02) .26 (.02) .27 (.02) .26 (.02) .32 (.04) .32 (.06) .33 (.09) .30 (.02) .27 (.05) After .25 (.04) .24 (.02) .25 (.03) .29 (.05) .28 (.04) .26 (.03) .31 (.05) .27 (.04) .27 (.03) .28 (.04) Correct Response Before 12.78 (1.44) 13.02 (1.48) 11.78 (1.73) 12.63 (2.26) 15.63 (.97) 15.87 (1.15) 15.30 (1.85) 14.85 (1.63) 11.20 (3.57) 11.49 (3.14) 10.50 (3.18) 11.55 (2.92) 13.01 (2.88) After 13.76 (1.94) 12.56 (1.45) 13.50 (1.41) 16.48 (1.33) 15.68 (1.59) 15.51 (1.36) 11.96 (3.40) 11.35 (2.94) 12.47 (2.86) 13.70 (2.73) Testing time Before 190.65 (75) 195.22 (63) 192.17 (48) 212.39 (53) 181.87 (75) 157.91 (55) 161.02 (58) 184.68 (53) 189.20 (58) 191.66 (61) 202.72 (62) 211.16 (59) 189.74 (59) After 295.11 (120) 255.37 (114) 215.66 (56) 230.50 (113) 199.79 (101) 184.68 (53) 295.40 (119) 263 (126) 211.16 (58) 238.80 (106)
ECAT
CFIT
Total
Note. n is sample size. Rows labeled CAT, ECAT, and CFIT contain the means and standard deviations (between parentheses) of variables named in columns. Before and After refer to occasion.
801
802
Table 3 ANOVA Results for Estimated Ability, Standard Error, Correct Responses and Testing Time by Test Type and Occasion in the Review Conditions Estimated Ability Test type Review condition Occasion Test Review Test Occasion Review Occasion Test Review Occasion p < .01 2 = 0.04 2 p = .74 = .00 2 p < .01 = .23 2 p < .04 = .02 2 p < .01 = .02 2 p = .73 = .00 p = .22 = .01
2
Standard Error p < .01 = 0.38 2 p = .21 = .01 2 p = .47 = .00 2 p = .39 = .01 2 p < .01 = .04 2 p < .01 = .02 p = .27 = .01
2 2
Correct Responses p < .01 = 0.39 2 p < .01 = .03 2 p < .01 = .25 2 p = .12 = .02 2 p = .20 = .01 2 p = .33 = .00 p = .53 = .01
2 2
Testing Time p < .01 = .07 2 p < .01 = .02 2 p < .01 = .31 2 p = .31 = .01 2 p < .05 = .02 2 p < .01 = .26 p = .32 = .01
2 2
Note. The cells contain the results of significance tests of the ANOVA F-ratios. p is critical value, 2 is the proportion of explained variance.
review conditions. As expected, number of correct responses is higher in ECAT than in CAT, though estimated ability and standard error are similar. CFIT is the most difficult test, as indicated by the lower mean of correct responses; it also produced the highest ability estimates. Results of the analysis of variance for the effects of review condition, test type, and occasion appear in Table 3. Number of correct responses. The main effect of the three independent variables was statistically significant at the .01 level. With regard to test type, the highest mean appears in the ECAT condition, whereas the lowest corresponds to the CFIT. As far as the effect of review is concerned, according to the Tukey test for multiple comparison, mean in the RB condition is lower than in RE. The effect of occasion is that correct responses increase after review. Estimation of ability. Differences in number of correct responses do not necessarily reflect on estimated ability. Test type had a significant effect on ability estimate, though effect size was extremely small. According to the Tukey test, the best estimation of ability occurs in the CFIT, as compared to the CAT (p < .01) and the ECAT (p < .05). No statistically significant differences were found between CAT and ECAT. With regard to occasion, estimated ability was significantly higher after review, though type of review had no statistically significant effect. Standard error. All of the differences among types of test were statistically significant. The smallest standard error occurs in the CAT, as compared to the
REVUELTA ET AL.
803
ECAT and the CFIT. The ECAT also produced a smaller standard error than the CFIT. Item review did not affect the standard error. Testing time. The ANOVA results for testing time indicate that main effects were statistically significant for test type, review condition, and occasion. Multiple comparisons revealed that testing time was significantly shorter in the ECAT, as compared to CAT and CFIT; no difference was found between CAT and CFIT. Testing time was statistically significantly higher in the RE condition, as compared to RI. With regard to the occasion effect, response time was shorter the first time the examinee saw the items than during review, though there were differences among conditions. The ratio of mean testing time in the conditions RE, RB, and RI, as compared to NR, are 1.46, 1.27, and 1.09. These data provided an indication of the increase in testing time that can be expected under the different conditions (see Table 1). However, they may be subject to the particular features of the test, in particular, the simplicity of items and the restriction of answer time both in the first item application and in review. Item difficulty is correlated with testing time in the CFIT condition. The first items (least difficult) took much less time than the final ones, where examinees took 150% more time to answer. Figure 2 presents the box plot of response time by review condition and occasion. Response time after review is total time, summing the time in the two applications of the items. The pattern in Figure 2 shows that response time during review was short in RI and did not contribute to increase total testing time. Total test time increased with the size of the blocks of items to be reviewed. This result may be due to the increase in time interval between the two applications of each item, which has an adverse effect on memory. Effects of Test Type, Item Review, and Occasion on Psychological Variables State anxiety. Overall mean and deviation of SAS1 are 19.45 and 4.87, with a Spearman Brown reliability of 0.843; for SAS2, these statistics are 19.40, 4.26, and 0.791, respectively. Given that scores can only take values between 10 and 40 in the anxiety scales, results suggest that anxiety was low both before and after review. Table 4 contains descriptive statistics and analysis of variance for AI grouped by experimental condition. F-ratios indicate that test type and review condition have no statistically significant effects. The lack of effect on AI is unexpected from the perceived performance and perceived control hypotheses. Two additional predictions of the perceived performance hypothesis were not supported by the data. First, AI did not correlate with ability change,
804
600 500
Mean Time
Before After
NR
RE
RB
RI
Review
Figure 2. Box plot of response time by review condition.
Table 4 Descriptive Statistics and ANOVA Results of Anxiety Increase by Test Type and Item Review Conditions CAT NR RE RB RI Total Statistical tests Test type Review condition Test Review 0.35 (4.52) -1.64 (4.02) 0.09 (3.64) -0.24 (3.79) -0.36 (4.05) p = .06, = .01 2 p = .10, = .01 2 p = .54, = .01
2
ECAT -0.26 (3.35) -0.26 (3.76) 0.19 (3.53) -0.64 (3.66) -0.24 (3.52)
CFIT 0.68 (2.88) -0.21 (3.62) 0.56 (4.05) 0.95 (4.26) 0.49 (3.74)
Note. The cells represent the means and standard deviations (in parentheses) of Anxiety Increase within the review conditions. The statistical tests refer to the results of significance tests of the ANOVA F-ratios.
either in the complete sample (r = .045, p = .706) or under the different review conditions. Second, AI and estimated ability did not correlate in the fixed test condition, either in the complete sample (r = .049, p = .430) or under the different review conditions. It makes no sense to correlate ability with AI in the adaptive conditions because item difficulty is targeted to ability, which implies that participants at all ability levels perceive a similar performance: around 50% chance of success. Comfort of examinees during the test. Four items (C1 to C4) evaluate different aspects of examinees comfort and satisfaction with the testing
REVUELTA ET AL.
805
conditions. Descriptive statistics and ANOVA results for these items are shown in Table 4. There are no effects of either test type or review condition on these variables, with one exception: review condition has a statistically significant effect on Item C1, though effect size is very small. Tukeys multiple comparison test revealed that participants in the no review condition concede less importance to the opportunity of review than participants in the review conditions.
Discussion
This study aimed to determine the optimal conditions for application of computerized testing. The following objectives were addressed: (a) to study the psychometric effects of three different test conditions (CAT, ECAT, and CFIT) and four different item review conditions (no review, review at end, review by blocks of five items, and review item-by-item) on estimated ability and standard error; (b) to determine the test application conditions that produce minimal anxiety in the examinees; (c) to study the effects of the different conditions on examinees perceived importance of item review and comfort with the testing experience; and (d) to study the effects of the different conditions on testing time. With regard to the first objective, rates of responses changed after review were higher than in previous studies (Vispoel, 2000), this difference being due to the match between difficulty and ability. A higher rate of change was found in the CFIT condition due to the inclusion of difficult items; in contrast, fewer changes were made in the ECAT condition. The CFIT condition produced higher estimated ability and standard error than the CAT and ECAT conditions. The difference in standard error is an expected result given the differences in the information function of the three tests. However, the tests should not produce differences in estimated ability. Additional research is necessary to investigate possible biases in estimation of ability due to the item selection algorithm. No statistically significant differences in ability level and standard error were found among the review conditions. However, after review participants increased significantly their number of correct responses and improved their estimated ability without losing precision in the estimation, regardless of the type of review permitted. Contradictory results regarding the effects of review on mean estimated ability may be related to the administration format of the items (see Lunz & Bergstrom, 1994; Lunz et al., 1992; Stone & Lunz, 1994; Vispoel, 1998, 2000; Vispoel et al., 1999, 2000; Olea et al., 2000). The improvement in number of correct responses may be explained in part by the characteristics of the study. Because there was a time limit of 15 seconds to answer each item, participants may have benefited by using review to read more carefully each question and to detect and change incorrect answers. The high rates of examinees that decide to review (60%) and of
806
responses changed in review (80%, as compared to the 60% average found in previous studies) may be due in part to lack of time for processing the item in the first run. In contrast, if the test does not include a strict time limit for responding to each item, no statistically significant changes in difficulty should be expected between the two applications of the items. It is important to note that the item pool is calibrated with the 15-second time limit per item. However, administering the item a second time changes the conditions of item calibration, and constitutes a threat to the accuracy of the item parameter estimates, so that the difficulty of the item may decrease. Given that item parameters are taken to be the same in the first administration as in review, the consequence must be an improvement in ability estimated after review. More complex psychometric models should be used to address this problem, including explicit parameterization for a test-retest data collection design. At the present stage of research, these results provide an outline of the consequences of review in specific application conditions. With regard to the second objective, no effects on anxiety were found due to test type, review condition or its interaction. Anxiety increase was related neither to ability, as expected from the perceived performance hypothesis, nor to review condition, as expected from the perceived control hypothesis. The study did not contribute to clarifying the validity of the two theories of anxiety underlying the design of the test application scenario. The lack of effects on anxiety may be due to low statistical power and a floor effect caused by the low-stakes nature of the test. It is clear that instructions to examinees did not succeed in obtaining a medium-stakes condition. Results cannot be generalized to more stringent conditions in which test results have important consequences for examinees. Moreover, in a previous study with a smaller set of conditions (Olea et al., 2000), statistical differences appeared in anxiety between the fixed and the adaptive tests. The sample used in the present study may have been insufficient in size to detect differences among the many conditions considered. As far as the third objective is concerned, examinees who were allowed to review conferred more importance on the opportunity to review than examinees who were not allowed to do so. Interestingly, participants are pessimistic, in the sense that, although most do improve ability estimate when reviewing, they later indicate that their score has not changed. Moreover, those that decided not to review indicated that their score would have been poorer if they had reviewed. These results suggest a divergence between perceived and actual performance. If the perceived performance hypothesis is correct, it is important to provide appropriate feedback to compensate for participants a priori expectations. Participants in all the review conditions and test types feel that review has contributed to increasing comfort with the testing experience, even though they do not believe that their scores improve.
REVUELTA ET AL.
807
Increased control over the test provided by review may be posited to explain these findings. Finally, concerning the fourth objective, overall results indicate that review increases testing time by around 26%. The increase is greater in the CAT and CFIT than in the ECAT, which may be due to the easiness of the ECAT. The review condition in which testing time increases most is review at end, with an increase of 51%. This is because, although in the other conditions participants do not necessarily have to review all the items, in the review at end condition, participants who decide to review are applied the complete test in the second phase. To sum up, this study includes a large number of testing and review conditions designed to test two hypotheses of state anxiety. The results replicate previous findings but incorporate the new finding of a noteworthy improvement in ability estimates after review, with no loss of precision. Improvement in ability estimates may be due to distortion of psychometric properties of the item bank in the review phase. No effect of review on anxiety was found, though examinees did appreciate the opportunity for review. Two additional lines of research should be established on the basis of these findings. First, applications in high-stakes settings are necessary to obtain additional evidence about effects on anxiety; second, there is a need for item-response models for repeated-measures designs that effectively deal with modification of item difficulty.
References
Bergstrom, B., & Lunz, M. (1999). CAT for certification and licensure. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assessment (pp. 67-91). Mahwah, NJ: Lawrence Earlbaum. Bunderson, C. V., Inouye, D. K., & Olsen, J. B. (1989). The four generations of computerized educational measurement. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 367407). New York: Macmillan. Gershon, R. C., & Bergstrom, B. (1995, April). Does cheating on CAT pay: Not. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Lunz, M. A., & Bergstrom, B. A. (1994). An empirical study of computerized adaptive test administration conditions. Journal of Educational Measurement, 31, 251-263. Lunz, M. A., Bergstrom, B. A., & Wright, B. D. (1992). The effect of review on student ability and test efficiency for computerized adaptive tests. Applied Psychological Measurement, 16, 33-40. Olea, J., Ponsoda, V., Revuelta, J., & Belch, J. (1996). Propiedades psicomtricas de un test adaptativo de vocabulario ingls [Psychometric properties of a CAT of English vocabulary]. Estudios de Psicologa, 55, 61-73. Olea, J., Revuelta, J., Ximnez, M. C., & Abad, F. J. (2000). Psychometric and psychological effects of review on computerized fixed and adaptive tests. Psicolgica, 21, 157-173. Ponsoda, V., Olea, J., & Revuelta, J. (1994). ADTEST: A computer adaptive test based on the maximum information principle. Educational and Psychological Measurement, 54(3), 680686.
808
Ponsoda, V., Olea, J., Rodrguez, M. S., & Revuelta, J. (1999). The effects of test difficulty manipulation in computerized adaptive testing and self-adapted testing. Applied Measurement in Education, 12, 167-184. Ponsoda, V., Wise, S., Olea, J., & Revuelta, J. (1997). An investigation of self-adapted testing in a Spanish high school population. Educational and Psychological Measurement, 57, 210221. Revuelta, J., & Ponsoda, V. (1997). Una solucin a la estimacin inicial en los tests adaptativos informatizados. [A solution to initial estimation in CATs]. Revista Electrnica de Metodologa Aplicada, 2, 1-6. Revuelta, J., & Ponsoda, V. (1998). A comparison of item exposure control methods in computerized adaptive testing. Journal of Educational Measurement, 35, 311-327. Spielberger, C. D., Gorsuch, R. L., & Lushene, R. E. (1970). Manual for the State-Trait Anxiety Inventory. Palo Alto: Consulting Psychologists Press. Spanish adaptation by TEA ediciones S. A. (1988, 3rd ed.). Stocking, M. L. (1997). Revising item responses in computerized adaptive tests: A comparison of three models. Applied Psychological Measurement, 21, 129-142. Stone, G. E., & Lunz, M. E. (1994). The effect of review on the psychometric characteristics of computerized adaptive tests. Applied Measurement in Education, 7, 211-222. Sykes, R. C., & Ito, K. (1997). The effects of computer administration on scores and item parameter estimates of an IRT-based licensure examination. Applied Psychological Measurement, 21, 51-63. Vispoel, W. P. (1998). Reviewing and changing answers on computer-adaptive and self-adaptive vocabulary tests. Journal of Educational Measurement, 35, 328-345. Vispoel, W. P. (2000). Reviewing and changing answers on computerized fixed-item vocabulary tests. Educational and Psychological Measurement, 60, 371-384. Vispoel, W. P., Hendrickson, A. B., & Bleiler, T. (2000). Limiting answer review and change on computer adaptive vocabulary tests: Psychometric and attitudinal results. Journal of Educational Measurement, 37, 21-38. Vispoel, W. P., Rocklin, T. R., Wang, T., & Bleiler, T. (1999). Can examinees use a review option to obtain positively biased ability estimates on a computerized adaptive test? Journal of Educational Measurement, 36, 141-157. Vispoel, W. P., Wang, T., de la Torre, R., Bleiler, T., & Dings, J. (1992, April). How review options, administration mode and anxiety influence scores on computerized vocabulary tests. Paper presented at the meeting of the National Council on Measurement in Education, San Francisco. (ERIC Document Reproduction Service No. TM018547) Waddell, D. L., & Blankenship, J. C. (1995). Answer changing: A meta-analysis of the prevalence and patterns. Journal of Continuing Education and Nursing, 25, 155-158. Wainer, H. (1993). Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practice, 12, 15-20. Wise, S. L. (1994). Understanding self-adapted testing: The perceived control hypothesis. Applied Measurement in Education, 7(1), 15-24. Wise, S. L. (1995, August). Item review and answer changing in computerized adaptive tests. Paper presented at the Third European Conference on Psychological Assessment, Trier, Germany. Wise, S. L., Freeman, S. A., Finney, S. J., Enders, C. K., & Severance, D. D. (1997, April). The accuracy of examinee judgements of relative item difficulty: Implications for computerized adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.