Sie sind auf Seite 1von 9

NAEP Scoring

NAEP assessments include multiple-choice items, which are machine-scored by optical mark reflex
scanning, and constructed-response items, which are scored by trained scoring staff. These trained
scorers ("raters") use an image-based scoring system that routes student responses directly to each
rater. Focused, explicit scoring guides are developed to match the criteria emphasized in the assessment
frameworks. Consistency of scoring between raters is monitored during the process through ongoing
reliability checks and frequent backreading.

Throughout the scoring process, three types of personnel make up individual scoring teams:

• Raters are professional scorers who are hired to rate the individual student responses, or items.
• Supervisors lead teams of raters throughout the scoring process on a daily basis.
• Trainers provide training for the scoring raters, continually monitoring the progress of each
scoring team.

Team members are required to have, at a minimum, a baccalaureate degree from a four-year college or
university. An advanced degree, scoring experience, and/or teaching experience is preferred. Scoring
teams use the training process to determine whether each individual rater is sufficiently prepared to
score. Following training , each rater is given a pre-scored "qualification set" and expected to attain 80
percent correct in order to proceed.

All scoring is carried out via image processing. To assign a score, raters click the mouse over a button
displayed in a scoring window. Since buttons are included only for valid scores, there is no editing for out-
of-range scores. Two significant advantages of the image-scoring system are the ease of regulating the
flow of work to raters and the ease of monitoring scoring. The image system provides scoring supervisors
with tools to determine rater qualification, to backread raters, to determine rater calibration, to reset trend
rescore items, to monitor trend rescore items through t-statistics reports, to monitor interrater reliability,
and to gauge the rate at which scoring was being completed.

The scoring supervisors monitor work flow for each item using a status tool that displays the number of
responses scored, the number of responses first-scored that still need to be second-scored, the number
of responses remaining to be first-scored, and the total number of responses remaining to be scored. This
allows the scoring directors and project leads to accurately monitor the rate of scoring and to estimate the
time needed for completion of the various phases of scoring.

1
Backreading

After scoring begins, NAEP scoring supervisors review each rater's progress using a backreading utility
that allows the scoring supervisor to review papers scored by each rater on the team. Scoring supervisors
make certain to note the score the rater awards each response as well as the score a second rater gives
that same paper. This is done as an interrater reliability check. Typically, a scoring supervisor reviews
approximately 10 percent of all item responses scored by each rater.

Alternatively, a scoring supervisor can choose to review all responses given a particular score to
determine if the team as a whole is scoring consistently. Both of these review methods use the same
display screen, showing the ID number of the rater and the scores awarded. If the scoring supervisor
disagrees with the score given an item, he or she discusses it with the rater for possible correction.
Whether or not the scoring supervisor agrees with the score, he or she assigns a scoring supervisor score
in backreading. If this score agrees with the first score, the score is recorded only for statistical purposes.
If the scores disagree, the scoring supervisor score overrides the first score as the reported score.
Replacement of scores by the scoring supervisor is done only with the knowledge and approval of the
rater, thereby serving as a learning experience for the rater. Changing the score does not change the
measurement of interrater reliability.

2
Calibration of Scoring Raters

For new assessment items, the scoring supervisor of each team invokes calibration as needed from the
tool used in backreading. During backreading, the scoring supervisor has a pool of 300 responses for
each item to use in the calibration process. The scoring supervisor views samples of these responses
together with the scores assigned by the first, and if applicable, second rater. From this pool, the scoring
supervisor chooses some responses to put into the pool—responses that have been scored correctly and
are a good measure to keep scoring on track. From this pool, the supervisor builds sets with the desired
number of responses, usually between five and ten. These sets are then released on the image-based
scoring system for raters to score.

When raters invoke the calibration window, all raters receive the same responses and score them. After
all raters have finished scoring this pool, the scoring supervisor can look at reliability reports, which
include only the data from the calibration set just run. This process serves to refresh training and avoid
drift in scoring. If pre-scored paper calibration sets already exist, these can be used to calibrate raters
instead of the image-based sets created by the scoring supervisor.

In general, each team scores calibration sets whenever they take a break longer than fifteen minutes,
such as when returning from lunch or at the beginning of a shift. Raters scoring trend rescore items are
calibrated using trend rescore responses that are already loaded into the system with the scores given
during prior year scoring.

3
t statistics
A goal in scoring is consistency in scores that are assigned to the same responses by different raters
within the same year or by different raters across different assessment years. Statistical flags are used to
identify items for which scoring is not consistent. A system allowing for t statistics to be calculated to
compare the scores for each of the item responses that has been rescored at different points in the
scoring process has been implemented.

To calculate a t statistic, the scoring supervisor executes a command in the report window. The scoring
supervisor is then prompted for the item, the application (or purpose for which the t statistic is being
performed), and the scoring group to which the item is assigned. The system then displays the results,
which are printable. The test results are based only on responses for which there are two scores and for
which both scores are on task. The display shows number of scores compared, number of scores with
exact agreement, percent of scores with exact agreement, mean of the scores assigned during the
scoring process for previous assessment years, mean of the currently assigned scores, the mean
difference, variance of the mean difference, standard error of the mean difference, and the estimate of the
t-statistic. The formulas used are as follows:

• Dbar = Mean Score 2 – Mean Score 1, where

Dbar is the mean difference,


Mean Score 1 is the mean of all scores assigned by the first rater, and
Mean Score 2 is the mean of all scores assigned by the second rater.

• DiffDbarsq = ((Score 2 – Score 1) – Dbar)2, where

DiffDbarsq is calculated for each score comparison.

• VarDbar = (sum(DiffDbarsq))/(N–1), where

VarDbar is the variance of the mean difference.

• SEDbar = SQRT (VarDbar/N), where

SEDbar is the standard error of the mean difference, and


N is the number of responses with two scores assigned by two different raters.

• Percent Exact Agreement = number of responses with identical scores/total number of double
scored responses being compared, where

Exact Agreement is a response with identical scores assigned by two different raters.

• T = Dbar/SEDbar

For purposes of calculations, the possible scores for a response to an item are ordered categories
beginning with 0 and ending with the number of categories for the item responses minus one.

The estimate of a t statistic is acceptable if it is within the range from -1.5 to 1.5. The range of + or - 1.5
was selected because only one criterion was required for all items, regardless of the number of
responses with scores being compared. As the number of responses with scores being compared gets
large, 1.5 as the criterion means that about 15 percent of the differences were judged not acceptable
according to the test when they should have been acceptable. If the estimate of the t statistic was outside
that range, raters were asked to stop scoring so the situation could be assessed by the trainer and
4
scoring supervisor. Scoring resumed only after trainer and scoring supervisor had determined a plan of
action that would rectify the differences in scores. Usually, different responses to the item were discussed
with the raters or raters were retrained prior to the continuation of scoring.

5
Training for the Scoring Raters

Training of NAEP scoring raters is conducted by subject-area specialists from Educational Testing
Service (ETS) and Pearson. All assessments are scored item-by-item so that each rater works from only
one scoring guide at a time. After scoring all available responses for an item, a team then proceeds with
training and scoring of the next item.

Training for current assessment scoring involves explaining the item and its scoring guide to the team and
discussing responses that represent the various score points in the guide. The trainer provides three or
four student responses to "anchor," each score point. When review of the anchor responses is completed,
the raters score 10 to 20 pre-scored "practice papers" that represent the entire range of score points the
item could receive. The trainer then leads the team in a discussion of the practice papers to focus the
raters on the interpretation of the scoring guide. After the trainer and supervisor determine that the team
has reached consensus, the supervisor releases work on the image-based scoring system for the raters.
The raters initially gather around a PC terminal to group-score items to ensure further consensus.
Following group-scoring, raters work in pairs as a final check before beginning work individually. Once the
practice session is completed, the formal scoring process begins. During training, raters and the
supervisor keep notes of scoring decisions made by the trainer. The scoring supervisor is then
responsible for compiling these notes and ensuring that all raters are in alignment; this process in referred
to as calibration in NAEP. Teams vary greatly in the amount of time spent scoring as a group before
working individually.

Training for trend scoring is only slightly different in that prior year trend papers must be reviewed to
understand scoring decisions made in prior years before raters can commence further scoring.

6
Online Training

Online training is used because of the large number of raters needed to accomplish the NAEP scoring.
Training online reduces the number of trainers that are required. Another benefit is that it allows flexibility
of location to make better use of scoring personnel.

This proved particularly helpful during the 2000 assessment—the assessment in which online scoring
was first introduced, as science items that were originally scheduled to be scored in Iowa City, Iowa were
actually scored in Tucson, Arizona. Educational Testing Service (ETS) and Pearson jointly agreed to
identify some items from each subject area that would be conducive to online training. ETS specialists
from each subject area identified 37 items—13 mathematics items, 4 reading items, and 20 science
items—for online training. ETS specialists in science and reading prepared keys and annotations for the
science and reading items respectively, while the mathematics scoring director from Pearson prepared
the keys and annotations for the mathematics items. Next, Pearson photocopied training sets, scanned
training sets, enhanced and cropped images that had been scanned, created instruction screens, edited
all images and annotations, and finally edited the beta CD prior to duplicating the master CD. NAEP
online training replicated the traditional NAEP training/scoring process except that the training was
actually done via CD. The speed was controlled by the individual rater. At the end of the training process,
the raters scored a qualification set (or a practice set if a qualification set was not available.) These
scores were printed out to determine whether any of the raters needed additional help or could proceed
with scoring. Scoring supervisors still monitored scoring using scoring supervisor tools and consulted with
an assigned trainer or the ETS specialist if problems occurred.

7
Selection of Training Papers

In January of each assessment cycle, clerical staff and a professional printing company begin the process
of preparing sets of training papers by copying all sets for items that are to be replicated from prior years
for trend scoring. These papers are sorted by item and are numbered. Then a photocopy is made of each
set of papers and sent to the NAEP instrument development staff for rangefinding . The original is kept at
the NAEP materials headquarters in Iowa City, Iowa so that the sets can be compiled according to
instructions from the instruments developers.

After review by each subject area's coordinator, the instrument development staff send the keys and/or
the training sets to the materials processing staff, who label them according to standard format and
reproduce the sets of papers using the original copies located at the materials headquarters. Correct
scores are written on all anchor papers, while only the scoring supervisors and trainers have keys for the
practice, calibration, and qualification sets. Trainers also keep annotations explaining the thought process
behind each score assigned. If any of these scores changes during training or scoring, the scoring
supervisor keeps notes explaining why.

For the 2000 assessment, training papers were selected for 126 trend rescore items from mathematics,
41 from reading, and 200 from science. NAEP clerical staff photocopied sample responses for new items
that had been field tested in spring of 1999. This included 37 new items from mathematics, 5 from
reading, and 46 from science. The number of sample responses photocopied for each new item ranged
from 50 to 100 depending on the difficulty of the item.

For the 2001 assessment, training papers were selected for 64 trend rescore items from geography and
79 from U.S. history. This process was also implemented for the eight writing and 38 reading field test
items. The number of sample responses photocopied for each new item ranged from 100 to 250
depending on the subject area.

8
Trend Scoring

To measure comparability of current-year scoring to the scoring of the same items scored in prior
assessment years, a certain number of student responses per item from prior years are retrieved from
image archives or rescanned from prior assessment year booklets and loaded into the system with their
scores from prior years as the first score. These are loaded into a separate application to keep the data
separate from current year scoring.

At staggered intervals during the scoring process, the scoring supervisor releases items from prior
assessment years for raters to score. Since prior year scores are pre-loaded as first scores, the current
year's teams are 100 percent second-scoring the prior year papers. Following scoring of trend rescore
items from prior years, scoring supervisors and trainers look at reliability reports, t-statistic reports, and
backreading to gauge consistency with prior year scoring and make adjustments in scoring where
appropriate.

The score given to each response is captured, retained, and provided for data analysis at the end of
scoring. For each item one of the following decisions is made based on these data:

• continue scoring the current year responses without changing course;


• stop scoring and retrain the current group of raters; or
• stop scoring, delete the current scores, and train a different group of raters.

For the 2000 and 2001 NAEP assessments, the initial release of trend item responses on the image-
based scoring system took place very soon after training was completed. Scoring supervisors controlled
the number released by asking raters to score a certain amount that totaled the number required.
Immediately upon completion, the scoring supervisor accessed the t-statistic report. The acceptable
range for the t value was within + or - 1.5 of zero. If the t value was outside that range, raters were not
allowed to begin scoring current year responses. Usually this next group of responses were scored
successfully. Scoring of current year responses only began after a successful t-test.

These trend items were also released after every break over fifteen minutes long (first thing in the
morning and after lunch) to calibrate raters. Scoring resumed only after the trainer and scoring supervisor
had determined a plan of action. This was usually accomplished by their studying scored papers from
prior assessment years to find trends in scoring. This helped determine what needed to be communicated
to the raters before scoring could begin again.

The t-statistic was printed out at the end of every trend release. An interrater agreement (IRA) matrix was
also viewed after every trend release. The matrix was used as a tool to determine if the team was scoring
too harshly or too leniently. IRAs were required to be within + or - 7 of the trend year reliability. Trainers
and scoring supervisors had access to reliabilities for each item from prior years.

This "trend scoring" is not related to the long-term trend assessment. Trend scoring looks at changes over
time using main NAEP item responses (e.g., 2000 reading assessment scores for an item compared to
the 1998 reading assessment scores for that item). View a table that lists the differences between main
NAEP assessment and the long-term trend NAEP assessment.

Das könnte Ihnen auch gefallen