Computer Scoring of Student Writing

October 2014
Combined Human and Computer Scoring

of Student Writing
Table of Contents
Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Rationale for the Scoring Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Design of the Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
The Essays Used.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Human Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Computer/Automated Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Results of Analyses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Preliminary Analyses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Reliability Analyses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Decision Accuracy/Consistency and Standard Errors of Measurement.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Scorer Discrepancies (>1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Findings, Conclusions, and Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Appendix A: NECAP Prompts and Rubrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Appendix B: Scoring Study Rubrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
This document was prepared by Measured Progress, Inc. with funding from the Bill and Melinda Gates Foundation,
Grant Number OPP1037341. The content of the publication does not necessarily reflect the views of the Foundation.
© 2014 Measured Progress. All rights reserved.
2
Acknowledgments
This small scoring study was part of a larger project entitled Teacher Moderated Scoring Systems (TeaMSS).
The TeaMSS team at Measured Progress wishes to thank members of the leadership team of the New England
Common Assessment Program (NECAP), who allowed us to use student essays in response to released prompts
from previous years of NECAP testing. Our thanks also goes to staff from the Educational Testing Service for the
automated scoring of over 4500 essays using the e-rater® system, and to Yigal Attali, who provided invaluable
advice on the project. We also appreciate the involvement of Pearson Education, Inc., who used the Intelligent Essay
AssessorTM to score over 1000 Grade 11 essays for the project, and the significant data analysis efforts of Dr. Lisa
Keller and the Abelian Group, LLC.
It is appropriate to acknowledge the efforts of staff in departments within Measured Progress who contributed to the
effort – particularly Scoring Services and Data and Reporting Services. Finally, our greatest appreciation is extended
to the Bill and Melinda Gates Foundation whose commitment to education and support of teachers have been
considerable. For the support, patience, and wisdom of the Gates Project Officer, Ash Vasudeva, we are especially
grateful.
Background
In this age of the Common Core State Standards, large state assessment consortia, Race to the Top programs, and
college and career readiness initiatives, there is renewed interest in performance assessments, measures involving
the demonstration of higher order skills through the production of significant products or performances. Heretofore,
in many school subjects, there has been general satisfaction with, or at least acceptance of, indirect multiple-choice
measures of isolated content and skills, the overemphasis on which research has shown has had a negative impact
on instruction (Pecheone et al, 2010). Teachers focus on the kinds of knowledge emphasized in the high stakes tests
and strive to emulate those measures in their own tests. Although challenges associated with the measurement
quality of performance assessments can be addressed in situations in which testing time and scoring logistics and
costs are of lesser concern, efficiency has all too often been valued more than authenticity and higher order skills.
Interestingly, the scenario described above has not been the case with the assessment of writing. While it can be
argued that many students are not asked to write enough in school, when the time comes to assess writing, there
has generally not been an acceptable substitute for direct measures – writing assignments or prompts calling for
writing samples such as essays, letters, reports, etc. Sometimes a testing program will include multiple-choice
items addressing isolated writing skills (grammar and mechanics), but these are often in addition to direct writing
assessments. It seems that in the area of writing, educators are not willing to assume students who can respond
successfully to multiple-choice questions dealing with specific knowledge and skills in isolation can or will
necessarily apply the knowledge and skills addressed by such items when asked to write an essay. More importantly,
however, there are other aspects of good writing – such as topic development, organization, attention to audience –
that are not so readily assessed by the multiple-choice format.
That educators embrace direct writing assessment does not mean that the problems associated with the
measurement quality of this form of performance assessment have been solved. While there are those who believe
we should think differently about reliability and validity for performance assessments than we do for multiple-
choice tests, in fact we cannot. In general terms, consistency of measurement and measuring the right construct are
no less important for either type of test, and they are accomplished the same way for both types. Any test samples
a large domain of knowledge and skills. A good sampling of the domain (a large number of “items” and a broad
representation of the knowledge and skills in the domain) is what leads to a reliable test exhibiting content validity.
Because a multiple-choice item usually taps such a small piece of a subject domain, a multiple-choice test must
contain a large number of items to achieve an acceptable level of reliability. However, the production of an essay
3
requires the application of many skills and the synthesis of a great deal of information; therefore far fewer tasks are
required to achieve the same level of reliability. Time constraints have led many assessment programs to demand a
single essay from students. While such a one-item test can produce far more generalizable results than a very short
multiple-choice test, there is still variability of student performance across writing prompts/tasks. Therefore, more
than one task are desirable. Dunbar, Koretz, and Hoover (1991) showed very clearly that a second prompt or even two
more, increases reliability considerably. Beyond three, the situation is one of diminishing returns.
Of course, from a content validity perspective, representation of the domain to which results are to be generalized
is a concern. The domain of writing can be considered quite large and diverse, encompassing essays, letters,
reports, stories, directions, and many other forms of writing. However, the Common Core State Standards helps us
here as the writing standards, oriented toward college and career readiness, focus on essays in the expository and
argumentation modes, and the writing traits of interest in scoring are almost identical for both modes. Even where
they would logically differ, for example in “support,” they are similar in that expository essays should support a
theme or main idea while an argumentation essay should present evidence to support a position on an issue. Thus,
if one is interested in generalizing to the narrow subdomain of writing defined by the CCSS, then the number of
essays required from students can be very small.
Dunbar et al also addressed the contribution of multiple readings of essays to the reliability of writing scores. Their
results were similar to those for the number of tasks: the more readings (scorers), the more reliable the measure, but
after three scorings, there are diminishing returns. The study described in this report investigates an approach to
scoring writing that combines the notions of more “items” and more “readings,” accomplishing each in a different
way from that of the more common scoring approaches. The approach involves the use of both human and
automated (computer) scoring, with the computer scoring different writing attributes than the humans, rather than
being “trained” to provide the best estimate of human holistic scores (linear combination of computer trait scores).
It has application to summative assessments of writing at the state level for state accountability programs or at the
district level where common writing assessments across schools or districts might be used for a variety of purposes,
including contributing to information used in the evaluation of teaching effectiveness.
Rationale for the Scoring Approach

Holistic scoring is commonly used for summative assessments of writing involving large numbers of students. By
this method, two readers each independently assign a single score, often from 1 to 6, to a student essay. Discrepant
scores, scores that differ by more than one point, are typically arbitrated by a third reader. Automated computer
scoring is also used in some writing assessment programs. It is intended to reduce scoring time and expense,
these economies being more significant when the number of students responding to a prompt is greater. Probably
because there is not absolute faith in computer scoring yet, there are some who would argue that computer scoring
can take the place of second human readers, but should not be relied on solely for student scores. In fact, there is
some evidence that computer scoring is less capable than humans at distinguishing among papers at the extremes
of the writing performance continuum (Goldsmith et al, 2012). Zhang (2013) points out that automated scoring
systems can be “tricked” by test takers who realize that such things as sophistication of vocabulary and greater
length, even if text is incoherent, can lead to higher scores.
In programs combining direct writing assessment with multiple-choice writing components, there is an irony in that
students can earn perhaps ten independent points for their multiple-choice responses, which take five minutes to
provide, whereas they might spend two hours planning, drafting, and revising an essay to present considerably more
evidence of specific skills, yet only receive up to six points for this much greater effort and output. (Double scoring
can produce up to twelve points, but they are far from independent points since the two readers’ scores pertain to
the same essay.) There is even further irony in computer second scoring since the computer is actually evaluating
independent analytic traits, but then combining them statistically in such a way to produce the best estimate of the
single human scores.
4
To get more independent points for each essay and more useful information for reports, analytic scoring has been
used, whereby readers assign separate scores for typically three to six writing traits or attributes, such as topic
development, organization, support, language, and conventions. Of course, human analytic scoring takes longer
than holistic scoring and yields more discrepant scores that need arbitrating. Furthermore, there has long been
concern that human scorers have difficulty distinguishing among the traits or, more accurately, have difficulty
preventing scores on one trait from influencing scores on another. Thus, human analytic scoring fails at providing
more independent points and more useful information. Analytic scoring, particularly with annotations, is a more
reasonable approach for formative classroom assessment.
The scoring approach being investigated in this study has the humans providing no more than two scores – for
attributes humans can score readily – and has the computer scoring the attributes it can score effectively, but leaving
them as independent scores of different attributes, rather than combining them to produce an estimate of what
a human would assign as a holistic score. This accomplishes several things. It produces more independent score
points per essay, which better reflect the amount of evidence in an essay than a holistic score; and it saves on human
scoring time for second readings. The human and computer scores are like scores on different items in a multi-item
test. (The more items or independent score points, the more reliable the test.) It is recognized that these points a
student earns are not totally independent since they pertain to the same response to a single prompt or task. Clearly,
it is desirable to use a second prompt or even three prompts, but maximizing the independent points for each
student essay also is desirable. Actually, at the same time the computer is generating trait scores that can ultimately
be combined with the human scores to produce a total writing score, it can generate a holistic score as a check on
the human scores. While this holistic score would not be counted since the trait scores that go into it are counted as
independent scores, it can be used to identify human scores that need to be arbitrated.
Design of the Study

The primary research question addressed by this study was: Are two human scores from one reader, combined
with computer analytic scores, better than six human analytic scores with double scoring? “Better” was defined
in terms of discrepancy rates, correlations across tasks (reliability), decision accuracy, decision consistency, and
standard errors at cut points. A secondary research question was: Is a single human holistic score combined with
independent computer analytic scores more reliable than a single human holistic score combined with a computer
holistic score? Different kinds of reliability coefficients, applied to direct writing results, were also investigated.
The Essays Used

The general approach of this study was to score large samples of student essays multiple times in different ways,
yielding results for comparative analyses. It was desirable to gather two different pieces of writing from each student
in response to common prompts so that cross-task correlations could be computed as estimates of alternative
forms reliability. Essays from Grade 8 and 11 students gathered in conjunction with the New England Common
Assessment Program were used. NECAP is a common state accountability assessment program used by the several
New England states that belong to the NECAP consortium. For this program, at Grade 8 each student writes both a
long and a short essay each year in response to two common prompts. At Grade 11, each student produces two long
essays, one in response to a common prompt and one in response to a matrix-sampled prompt. The matrix-sampling
of prompts is used for the piloting of prompts for use as common prompts in later years. So as to not jeopardize
the security of prompts intended for future use, the study used Grade 11 essays from several years ago so that the
matrix-sampled prompt was one that was used more recently as a common prompt and subsequently released to the
public. The four NECAP prompts and their associated scoring rubrics are provided in Appendix A.
NECAP does not score all the essays associated with the matrix-sampled prompts being field tested. Consequently,
the study’s Grade 11 essays were those of only 590 students, many fewer than the number of Grade 8 students whose
5
essays were used. At Grade 8, to take advantage of the benefits of IRT scaling and linking to the NECAP writing
scale, the work of 1694 students was used. That work included not only the essays of the 8th graders, but also their
responses to non-essay-related test items in writing (multiple-choice, short-answer, and 4-pt constructed-response
items worth a total of 25 raw points.) The linkage to the NECAP scale enabled the use of the NECAP cut scores for
proficiency levels for analysis of decision accuracy and consistency. The scores on the Grade 11 pilot prompts were
not scaled in with the other writing measures for NECAP. However, for generating comparative data from NECAP,
the raw cut scores for proficiency levels used for NECAP were used, along with the NECAP operational rubrics and
procedures.
Human Scoring
Although the essays had already been scored for NECAP, they were rescored by humans using two different
methods for purposes of the study. Double scoring (independent scoring by two readers) was done for each method.
The scoring was accomplished by temporary scoring staff employed for other Measured Progress projects and
trained by regular, full-time scoring staff who followed routine benchmarking and training procedures. These
scoring leaders also monitored scorer accuracy during the scoring process by standard read-behind procedures, and
they arbitrated paired scores from double scoring that differed by more than one point.
One scoring method called for readers to assign one holistic score and five analytic scores to each essay. The
analytic scores were for organization, support, focus, language, and conventions. The second scoring method
required each reader to provide only two scores: a holistic score and a score for support. (Even though holistic
scoring does not focus on a single analytic trait, it was chosen as a basis for one of the human scores because writing
experts have argued that only humans can take into account “intangibles” that can and should influence holistic
scores.) The rubrics designed specifically for this study are presented in Appendix B.
Essays were scored one prompt at a time. The scorers and essays were divided into two groups such that half the
scorers used the first method first, while the other half used the second method first. (Thus, half the essays were
scored by the first method first, while the other half were scored by the second method first.) There were eighteen
experienced scorers involved in the study.
Computer/Automated Scoring
The essays at both grade levels were scored by the ETS e-rater® system, which yielded four analytic scores: word
choice, conventions, fluency/organization, and topical vocabulary (content). The system also produced a holistic
score, which for this study was not based on “training” to produce best estimates of human holistic scores. The
eleventh graders’ essays were also scored by Pearson’s Intelligent Essay AssessorTM (IEA). From this system, the
researchers obtained a holistic score (based on “training”) and three analytic scores: mechanics, content, and style.
The list below summarizes the score data that were available for each student essay:
MP/Gates human scores – holistic and 5 traits (organization, support, focus, language, conventions), double-

scored
MP/Gates human scores – holistic and 1 trait (support), double-scored

e-rater® scores – holistic and 4 traits (word choice, conventions, fluency/organization, topical vocabulary/content)

NECAP human scores – holistic, double-scored

(grade 11 only) IEA scores – holistic and 3 traits (mechanics, content, style)

6
Results of Analyses
Preliminary Analyses
As an initial check of the reliabilities of human and computer scoring, cross-task correlations between holistic scores
were computed for both types of scoring. (The human scores were the sums of two readers’ scores.) These alternative
form reliabilities for human holistic scores were .59 at grade 8 and .78 at grade 11 (from the 6-score method). The
corresponding reliabilities for computer scoring were .82 and .90 respectively. The lower values at grade 8 are not
surprising since one of the essays at that grade was a short one.
Tables 1 through 6 address the contention that human scorers tend to not differentiate among analytic traits well.
It should be noted that in these tables, the terms “Scorer 1” and “Scorer 2” actually refer to the first scoring and
the second scoring. There were many more than two scorers working on the project, and scorers were randomly
assigned essays to score. During the course of double scoring, a particular scorer was paired with many different
scorers and was sometimes a first reader, and sometimes a second reader of essays. Scorer 1 scores were used in
analyses in which single scoring was of interest.
The correlations in Tables 1 and 2 are average intercorrelations among the analytic traits. These correlations are
extremely high for human scorers as compared to those for the automated scoring systems. Some researchers have
argued that such high correlations between two measures have two explanations: either one measure is a direct
cause of the other, or the two measures are really measures of the same thing. This logic supports the notion that
the human scorers cannot easily separate the analytic traits in their minds as they are scoring. The correlations
among analytic traits from automated scoring are much more reasonable for variables that should be correlated,
but that are truly independent of one another. Clearly, the factors the computer scoring considers are different from
one another. Even though writing experts do not consider these factors ideal measures of the traits named, they are
good, correlated proxies that are easily “counted” or evaluated by computers. For example, vocabulary level is a basis
for e-rater®’s word choice, and essay length contributes to analytic traits scored by various computer systems.
7
TABLE 1: AVERAGE INTERCORRELATIONS OF ANALYTIC TRAITS – GRADE 8
Passage Scorer Correlation

Scorer 1 .937
Lightning Scorer 2 .938
e-rater® .290*
Scorer 1 .926
School Lunch Scorer 2 .924
e-rater® .615
* e-rater® Lightning intercorrelations being lower than

e-rater® School Lunch intercorrelations is to be expected because the Lightning essays are much shorter essays.
TABLE 2: AVERAGE INTERCORRELATIONS OF ANALYTIC TRAITS* – GRADE 11
Passage Scorer Correlation

Scorer 1 .947
Scorer 2 .946
Ice Age
e-rater® .614
IEA .753
Scorer 1 .916
Scorer 2 .918
Reflection
e-rater® .535
IEA .648
*Human scorers scored five analytic traits, while e-rater® and the IEA system scored four and three respectively.
It was hypothesized that if human scorers had fewer traits to score, then they could more easily separate traits in
their minds. Tables 3 and 4 suggest that this is not the case. They show the correlations between holistic scores and
support scores obtained via the two human scoring methods. While we would expect these correlations to be high
because support obviously contributes to holistic scores, the hypothesis mentioned above would lead one to expect
a lower correlation by the 2-score method. Such was not obtained.
TABLE 3: CORRELATIONS BETWEEN HUMAN HOLISTIC AND SUPPORT SCORES – GRADE 8
6-Score 2-Score
Passage Scorer
Method Method
Scorer 1 .907 .905
Lightning
Scorer 2 .906 .903
Scorer 1 .921 .924
School Lunch
Scorer 2 .924 .926
TABLE 4: CORRELATIONS BETWEEN HOLISTIC AND SUPPORT SCORES – GRADE 11
6-Score 2-Score
Passage Scorer
Method Method
Scorer 1 .946 .945
Ice Age
Scorer 2 .937 .941
Scorer 1 .930 .946
Reflection
Scorer 2 .933 .930
8
Tables 5 and 6 show all the correlations between first and second human scorings. While these are high as expected,
they are consistently lower than the intercorrelations among different trait scores from single scoring. In other
words, correlations between scores from different scorers on the same traits are not as high as correlations between
scores from the same scorers on different traits (refer to Tables 1 and 2). All of the analyses in these six tables lend
credence to the concern about the lack of independence of human analytic scores. Statistical tests of the significance
of differences among these correlations from “dependent populations” were not computed as the distribution of
such differences is not known, but the statistics speak for themselves.
TABLE 5: CORRELATIONS BETWEEN SCORES FROM SCORER 1 AND SCORER 2 – GRADE 8
Passage Holistic Organization Support Focus Language Conventions

Lightning .844 .807 .813 .800 .810 .810
School Lunch .905 .846 .840 .853 .857 .862
TABLE 6: CORRELATIONS BETWEEN SCORES FROM SCORER 1 AND SCORER 2 – GRADE 11
Passage Holistic Organization Support Focus Language Conventions

Ice Age .907 .870 .866 .871 .892 .899
Reflection .907 .881 .882 .865 .875 .862
Reliability Analyses
Estimating the reliability of direct writing assessments can be challenging, largely because the assessments are
often one-item tests – single writing prompts. NECAP has an advantage in that two prompts are administered
to each student in that program. This allows the computation of correlations between scores on two essays – an
alternative form reliability estimate for either of the two measures. For this study, these are reported in the first two
columns of Table 7. Interestingly, if the two essays were written in conjunction with different programs, one might
consider their correlation to be validity evidence. (Validity of scoring could be demonstrated by the correlation
between results of scoring the essays in response to the same prompt by two different scoring methods.)
As can be concluded from the Grade 8 results, the short essay, being a weaker measure of writing, resulted in a
lower cross-task correlation. Of course, the NECAP assessments are two-prompt tests, so the reliability of the total
tests could be estimated by applying the Spearman-Brown Prophecy Formula to .63 and .83, thereby estimating the
reliabilities of tests twice as long. However, that is not what was done to obtain the values in the Spearman-Brown
columns of Table 7. The alpha coefficients and Spearman-Brown estimates treated the score components as separate
items in multi-item tests. Reliability estimates by these two methods were computed separately for each prompt,
then the two estimates at a grade level were averaged. Thus, all the reliability estimates in Table 7 pertain to a one-
prompt test. Alpha coefficients and Spearman-Brown estimates are not shown for rows a through d because of the
lack of independence of the component scores which inflates internal consistency measures. The alpha coefficient
for the 6-score method at Grade 11, for example, was 0.99 (not shown).
9
TABLE 7: RELIABILITY ESTIMATES
Cross-Task Spearman-Brown
Alpha Coefficients
Score Combinations Correlations Estimates
Gr. 8 Gr. 11 Gr. 8 Gr. 11 Gr. 8 Gr. 11
a. Human holistic + 5 trait scores from the 6-score
method with double scoring (12 scores summed for .63 .83 – – – –
cross-task corrs.)
b. Human holistic + human support score from the
2-score method, single scored (i.e., Scorer 1 only) + .76 .89 – – – –
4 e-rater® trait scores
c. Human holistic + human support score from the
2-score method, single scored (i.e., Scorer 1 only) + – .88 – – – –
3 Pearson IEA trait scores (not holistic), Gr. 11 only
d. Human holistic score from 2-score method, double
.64 .82 – – – –
scored
e. Human holistic score from 2-score method, single
.78 .89 .89 .90 .83 .89
scored (i.e., Scorer 1 only) + 4 e-rater® trait scores
f. Human holistic score from 2-score method, single
scored (i.e., Scorer 1 only) + 3 Pearson IEA trait – .87 – .88 – .91
scores, Gr. 11 only
g. Human holistic score from 2-score method, single
scored (i.e., Scorer 1 only) + e-rater® holistic – .87 – .92 – .91
(untrained)
Consistent with data reported earlier, one of the Grade 8 essays being very short resulted in reliability coefficients
at that grade level being lower than those at Grade 11. Rows a, b, and c in Table 7 address the primary research
question. They suggest that, as hypothesized, the combined human and computer scoring approach proposed may
indeed yield somewhat higher reliability coefficients than pure human analytic scoring. Row d shows the cross-
task correlations for human holistic double scoring, a common approach in large-scale direct writing assessment.
The correlations are almost identical to those in Row a pertaining to human analytic scoring. This finding further
supports the notion that the human analytic scores are far from independent – i.e., not really reflecting different
attributes. The score combinations represented in rows e and f are the same as those in rows b and c, except that
the human-generated support scores were excluded. With all independent measures in these score combinations,
internal consistency coefficients (alpha coefficients and Spearman-Brown estimates) were computed. These were
very consistent with the cross-task correlations reported in rows b and c, again suggesting that nothing is gained by
the inclusion of the human analytic attribute score. Row g, reflecting combined human and computer holistic scores,
shows similar results to those in rows b, c, e, and f. Generally, the internal consistency measures, when appropriate to
compute, are consistent with the reliabilities via cross-task correlations.
Decision Accuracy/Consistency and Standard Errors of Measurement

With decisions being made about student performance based on cut scores on tests (pass/fail, proficient/not
proficient, etc.), measurement quality is often evaluated in terms of decision accuracy and consistency, as well as
standard errors of measurement at cut-points. Consequently, in addition to reliability analyses, this study looked at
these other quality indicators, which required putting the essay score data on the NECAP scale so that NECAP cut-
scores could be used.
Decision accuracy and consistency. Decision accuracy is an estimate of the proportion of categorization decisions
that would match decisions that would result if scores contained no measurement error. Decision consistency is an
10
estimate of the proportion of categorization decisions that would match decisions based on scores from a parallel
form. Tables 8 and 9 show the decision accuracy and consistency statistics for various combinations of Grade 8 and
Grade 11 writing score data. The Grade 8 combinations included the NECAP non-essay scores since they were used
in the NECAP scaling. What was varied in the different score combinations were the human and computer scoring
components of interest. The first two data rows in Table 8 and the first three data rows in Table 9 pertain to the two
models of most interest in this study: 6 human-generated scores versus 2 human-generated scores augmented by
automated analytic scores. While it might appear that differences in those two statistics (relative to the proficient
cut) favor the 6-score method, the actual differences are practically negligible. The last data row in each table shows
results for a score combination in which the human scoring contribution is only a holistic score. As with previous
analyses, the results suggest that human scores on a second trait do not make a difference.
TABLE 8: DECISION ACCURACY (AND CONSISTENCY) – GRADE 8

Near Proficient
Score Combination Overall
vs Proficient
MP/Gates human holistic + 5 traits (double scoring both long
0.92(0.88) 0.95(0.93)
and short essays) + NECAP non-essay scores
MP/Gates human holistic + 1 trait (single scoring both essays)
0.79(0.71) 0.92(0.88)
+ e-rater® 4 traits (both essays) + NECAP non-essay scores
NECAP holistic (single scoring both essays) + e-rater® 4 traits
0.77(0.68) 0.92(0.89)
(both essays) + NECAP non-essay scores
TABLE 9: DECISION ACCURACY (AND CONSISTENCY) – GRADE 11

Near Proficient
Score Combination Overall
vs Proficient
MP/Gates human holistic + 5 traits (double scoring both essays) 0.91(0.87) 0.97(0.95)
MP/Gates human holistic + 1 trait (single scoring both essays) +
0.79(0.71) 0.93(0.90)
e-rater® 4 traits (both essays)
0.87(0.81) 0.96(0.94)
Pearson IEA automated 3 traits (both essays)
0.76(0.68) 0.92(0.88)
(both essays)
NECAP holistic (single scoring both essays) + Pearson IEA 3 traits
0.84(0.78) 0.95(0.93)
(both essays)
Standard errors at cut-points. The same combinations of scores corresponding to the rows in Tables 8 and 9 are
used in Tables 10 and 11 below. The latter tables report the standard errors of measurement at the cut points, the
second cut being the one that separates proficient performance from not-proficient performance.
TABLE 10: STANDARD ERRORS AT CUTS* – GRADE 8

Standard Error (% of Range)
Score Combination
C1 C2 C3
MP/Gates human holistic + 5 traits (double scoring both long and
.87(1.09) 1.01(1.26) 1.40(1.75)
short essays) + NECAP non-essay scores
MP/Gates human holistic+1 trait (single scoring both essays) +
1.94(2.41) 2.07(2.59) 2.51(3.14)
e-rater® 4 traits (both essays) + NECAP non-essay scores
1.69(2.11) 2.66(3.32) 3.00(3.75)
(both essays) + NECAP non-essay scores
* All standard errors are on the scale score metric. NECAP uses an 80-point scale.
11
TABLE 11: STANDARD ERRORS AT CUTS* – GRADE 11
Standard Error (% of Range)
Score Combination
C1 C2 C3
MP/Gates human holistic + 5 traits (double scoring both essays) 1.09(1.36) 1.10(1.38) 1.24(1.56)
1.59(1.99) 1.64(2.05) 1.90(2.38)
e-rater® 4 traits (both essays)
1.55(1.94) 1.58(1.98) 1.74(2.18)
Pearson IEA automated 3 traits (both essays)
2.04(2.55) 2.06(2.58) 2.20(2.75)
(both essays)
NECAP holistic (single scoring both essays) + Pearson IEA 3 traits
1.97(2.46) 1.93(2.41) 1.96(2.45)
(both essays)
*All standard errors are on the scale score metric. NECAP uses an 80-point scale.
The results depicted in Table 10 and Table 11 suggest that the primary scoring methods of interest do make a
difference, with lower standard errors associated with 6 human-generated scores as opposed to 2 human-generated
scores (or a single human holistic score) augmented by automated analytic scores. Again, however, the differences
(and the standard errors themselves) are small.
The finding of slightly better results for the 6-score method (although the actual differences in decision accuracy/
consistency and standard errors were not great) was not expected, particularly in light of the reliability analyses.
However, it is unclear just what the impact is of the spurious, extremely high intercorrelations among human
analytic scores on these statistics for the 6-human-score method. (This situation approaches that of having a one-
item test, but counting that item many times as if each time it is a different item, which would inappropriately inflate
internal consistency measures and deflate standard errors. This is why internal consistency coefficients were not
reported for the 6-score method in Table 7.) Investigating the impact of greatly inflated internal consistency on
decision accuracy/consistency and standard errors was beyond the scope of this study.
Scorer Discrepancies (>1)

Typically, when two scorers award scores that differ by more than one point, a third reading is required to arbitrate
the discrepancy and determine final scores of record. The analyses leading to the results reported in Tables 12 and 13
used original, unarbitrated scores. Looking at frequencies of discrepancies is another way of evaluating agreement
rates. It was hypothesized that human agreement rates would be greater (and discrepancy rates lower) if there
were fewer analytic traits to score. The data in the Table 12 pertain to just the holistic and support scores awarded
using both the 6-trait and 2-trait methods – the two scores common to both methods. Indeed the discrepancy rates
(rates of score differences greater than one point) between scorers for these two scores appeared to be greater when
scorers had to award six analytic scores than when they awarded only the two scores.
TABLE 12: HUMAN SCORING DISCREPANCIES BY METHOD

Holistic and Support Scores Holistic and Support Scores
Number from 6-Score Method from 2-Score Method
Grade
of Essays
# Discrep. % Discrep. # Discrep. % Discrep.
8 3376 303 8.98 221 6.55
11 1168 74 6.34 55 4.71
12
Since the Pearson IEA system produced holistic scores after “training” of the computer to mimic human scores, a
common practice, the study could compare human-to-human with human-to-computer discrepancy rates. Table
13 shows that the discrepancies between human holistic scores and computer holistic scores (2.08% on average)
are fewer than those between two human scorers (3.21% on average). However, that difference is not large enough
to make a big difference in the expense of third readings for arbitrations, even recognizing that there would likely
be three times as many human discrepancies when scoring six traits rather than two traits. What would make
a meaningful monetary difference for large projects, however, is not the time to remedy discrepant scores, but
rather the time it takes humans to score six traits rather than two – roughly 25 percent more time. (This is a rough
approximation based on review of scorer time sheet information recorded during this study.)
TABLE 13: FREQUENCY OF HOLISTIC SCORE DISCREPANCIES >1 – GRADE 11

Human Discrepancies > 1 (S1 vs S2) Human (S1) vs IEA Discrepancies
# Scorings # Discrep. % Discrep. # Scorings # Discrep. % Discrep.
2336* 75 3.21 2066 43 2.08
*For each student, there were two essays double scored by humans by two methods.
13
Findings, Conclusions, and Recommendations
The primary purpose of this study was to investigate a method of scoring student writing that:
addressed the problem of the lack of independence of human-generated analytic scores;

produced more independent score points per essay in order to enhance measurement quality;

engaged humans in scoring that they could do better than computers and engaged computers in scoring that they

could do better than humans; and
capitalized on the efficiencies of computer scoring of essays.

That method, as implemented in this study, required (1) human scorers to assign two scores to each student essay:
one a holistic score, and the second a score for the analytic trait typically called “support” (single reader scoring)
and (2) an automated scoring system to generate multiple trait scores. The primary comparison method was all-
human double scoring that yielded a holistic score and five trait scores. Other combinations of writing scores were
examined to address other aspects of the scoring of writing relevant to the primary focus.
The key findings of analyses are summarized below:
1. Intercorrelations among human-generated and among computer-generated analytic writing scores strongly
supported the long-held concern that human scorers do not differentiate analytic traits effectively. There was an
extreme “halo effect.”
2. The combined human-computer scoring approach produced somewhat more reliable scores than the all-human
holistic plus analytic (double) scoring.
3. The combined human-computer approach including a human holistic score only was just as reliable as the
approach which included both human holistic and support scores.
4. The reliability of combined human holistic and human analytic double scoring was no greater than the
reliability of human holistic double scoring by itself.
5. With human holistic and computer analytic scores being clearly independent measures, internal consistency
measures of reliability were justified, and they were consistent with reliability estimates based on cross-task
correlations.
6. Small differences between the two primary methods in decision accuracy and consistency and standard errors
at the proficiency cut-points favored the human analytic scoring approach. It is unclear what impact the inflated
internal consistency of human analytic scores has on these three statistics.
7. As anticipated, when humans had only two scores to award, the scoring discrepancy rate was lower than the
discrepancy rate for those same two scores when scorers had to award six scores.
8. Discrepancy rates between human and computer-generated scores were lower than discrepancy rates between
two human scorers.
9. Although not studied precisely, perusal of timesheet data suggested that 6-score human scoring took
approximately 25 percent more time than the 2-score human scoring method.
Generally, the findings of this study support combined human holistic and computer analytic scoring, particularly
for programs requiring large scoring projects. There are writing traits that computers cannot score and unusual
features writers might create that computers cannot take into account. Clearly, however, computer analytic scoring
can provide more independent score points than human-generated trait scores, thereby enhancing reliability and
at the same time offering time- and cost-efficiencies. Single scoring of student essays by the human readers is
sufficient, if computer-generated holistic scores or a simple combination of the computer-generated analytic scores
is used to identify human scores that should be arbitrated. Of course, if there are high stakes for individual students
associated with the writing test results, more than one task/prompt are advisable.
14
Because the study design was partially dictated by large-scale assessment procedures and because of the particular
statistics computed, statistical hypothesis tests were not performed. However, since combined single human
holistic and computer holistic scoring are already accepted practice, the results of this study would easily justify a
state testing program employing on a trial basis the recommended approach of single human holistic scoring with
computer analytic scoring, at the same time as the human/computer holistic approach with the “training” of the
computer. If the results of the two are similar and the reliability of the “new” approach is the same as or better than
that of the other, then future assessments could employ the “new” approach without the need to train” the computer,
thus saving time and expense.
The study results are not a justification for abandoning human analytic scoring altogether. For classroom
formative and summative assessment and for district testing for which there is adequate time for scorers to provide
annotations along with scores, analytic scoring can be especially useful. It focuses teachers/readers on analytic
traits and can help them provide rich, meaningful feedback on how students can improve their essays. For example,
pointing out to a student specifically where clearer wording or a particular example might have enhanced an
argument is far more helpful than a simple numerical score on a trait. Computer-generated trait scores contribute
to overall test reliability, and can also provide some useful diagnostic information. However, educators should be
mindful that the computer uses proxy measures of the attributes the writing experts value, and can be “outsmarted”
the more knowledgeable the test takers become regarding those proxy measures.
Direct writing assessment has been the most accepted form of performance assessment implemented in our
schools. Yet even with our many years of experience, we have managed to shortchange basic measurement
principals in practicing it. It is not unusual for a single writing assignment to be a component in a broader
assessment of English language arts. Yet treating it as a one-item test, the only evidence of technical quality many
programs provide for their writing component is information on scorer agreement rates. This is NOT test reliability.
There is a lot of evidence of many aspects of effective writing in student essays, and independent computer-
generated trait scores, combined with human holistic scores, can help tap that evidence and make a total writing test
score range (and reliability estimates) more reflective of that large amount of evidence. The measurement quality of
direct writing assessments is as much a function of how we score the student work as it is a function of the quality of
the tasks.
References
Dunbar, S., Koretz, D. and Hoover, H.D. (1991). Quality control in the development and use of performance assessments. Applied Measurement
in Education, 4(4) 289-303.
Goldsmith, J., Davis, E., Kahl, S., DeVito, P. (2012). Can a Machine Cry? Current Research on Using Software to Grade Complex Essays.
Presentation delivered at the National Conference on Student Assessment, CCSSO, Minneapolis, June 28.
Pecheone, R., Kahl, S., Hamma, J., Jaquith, A. (2010). Through a Looking Glass: Lessons Learned and Future Directions for Performance
Assessment. Stanford, CA: Stanford University, Stanford Center for Opportunity Policy in Education.
Zhang, M. (2013). Contrasting automated and human scoring of essays. R&D Connections, No. 21. Princeton, NJ: ETS.
15
Appendix A: NECAP Prompts and Rubrics
16
Grade 8 NECAP Prompt—Lightning
For a class report, a student wrote this fact sheet about lightning. Read the fact sheet and then write a response to the
prompt that follows.
Lightning
Facts about lightning
thunder is made from the sound waves produced by lightning
�
lightning causes air to heat rapidly and then cool, producing sound waves
�
lightning happens mostly within clouds, but it can also happen between a cloud and Earth
�
causes more than 10,000 forest fires in the United States each year
�
Earth is struck by lightning 50 to 100 times each second worldwide
�
can strike up to 20 miles away from a storm
�
temperature of a lightning bolt hotter than the Sun
�
about 100,000 thunderstorms in the United States each year
�
chances of being struck by lightning 1 in 600,000
�
average flash of lightning could turn on a 100-watt bulb for three months
�
can strike more than once in the same place
�
“blitz” is the German word for lightning
�
moves 60,000 miles a second
�
kills or injures several hundred people in the United States each year
�
Write your response to prompt 13 on page 23 in your Student Answer Booklet.

13. Use the fact sheet to write an introductory paragraph for a report about the dangers of lightning.
Your paragraph should
� contain a lead sentence/hook that will interest the reader in the report,
� set the context for the report, and
� include a clear focus/controlling idea.
Select only the facts you need for your introduction.
17
Grade 8 NECAP Scoring Rubric for Lightning
Scoring Guide:
Score Description
Response provides an introduction to a report about the dangers of lightning. The paragraph contains an appropriate
and effective lead sentence, clearly sets the context for the report, and contains a clearly stated focus/controlling idea.
4 The paragraph includes only relevant facts from the fact sheet. The response is well-organized. The response includes
a variety of correct sentence structures and demonstrates sustained control of grade-appropriate grammar, usage,
and mechanics.
Response provides an introduction to a report about the dangers of lightning. There is a lead sentence, although
it may serve more to introduce the topic than to capture the reader’s interest. The paragraph sets the context and
3 contains a focus/controlling idea, but there may be minor lapses in focus or clarity. The paragraph includes mostly
relevant facts from the fact sheet. The response is generally well-organized. The response includes some sentence
variety and demonstrates general control of grade-appropriate grammar, usage, and mechanics.
Response is an attempt at a paragraph that is an introduction to a report on the dangers of lightning. The paragraph
may have no lead sentence, may not clearly set the context, or may lack a consistent focus or clear organization. The
2
paragraph includes some relevant facts from the fact sheet. The response includes some attempt at sentence variety
and may demonstrate inconsistent control of grammar, usage, and mechanics.
1 Response is undeveloped or contains an unclear focus. There is little evidence of logical organization.
0 Response is totally incorrect or irrelevant.
Blank No response
18
Grade 8 NECAP Prompt—School Lunch
Write your response to prompt 17 on pages 24 through 26 in your Student Answer Booklet.
When writing a response to prompt 17, remember to

read the prompt carefully,
�
develop a complete response to the prompt,
�
proofread and edit your writing, and
�
write only in the space provided.
�
17. Do you think that making your school lunch period longer is a good idea? Write to your principal to persuade
him or her to agree with your point of view.
Grade 8 NECAP Scoring Rubric for School Lunch
Scoring Guide:
Score Description
� purpose/position is clear throughout; strong focus/position OR strongly stated purpose/opinion focuses the writing
� intentionally organized for effect
6 � fully developed arguments and reasons; rich, insightful elaboration supports purpose/opinion
� distinctive voice, tone, and style effectively support position
� consistent application of the rules of grade-level grammar, usage, and mechanics
� purpose/position is clear; stated focus/opinion is maintained consistently throughout
� well organized and coherent throughout
5 � arguments/reasons are relevant and support purpose/opinion; arguments/reasons are sufficiently elaborated
� strong command of sentence structure; uses language to support position
� purpose/position and focus are evident but may not be maintained
� generally well organized and coherent
4 � arguments are appropriate and mostly support purpose/opinion
� well-constructed sentences; uses language well
� may have some errors in grammar, usage, and mechanics
� purpose/position may be general
� some sense of organization; may have lapses in coherence
3 � some relevant details support purpose; arguments are thinly developed
� generally correct sentence structure; uses language adequately
� attempted or vague purpose/position
� attempted organization; lapses in coherence
2 � generalized, listed, or undeveloped details/reasons
� may lack sentence control or may use language poorly
� may have errors in grammar, usage, and mechanics that interfere with meaning
� minimal evidence of purpose/position
� little or no organization
1 � random or minimal details
� rudimentary or deficient use of language
Blank No response
19
Grade 11 NECAP Prompt—Ice Age
Everyday Life at the End of the Last Ice Age
Informational Writing (Report)
A student wrote this fact sheet about life 12,000 years ago, at the end of the last ice age. Read the fact sheet. Then
write a response to the prompt that follows.
Everyday Life at the End of the Last Ice Age

people lived in bands of about 25 members
�
lived mainly by hunting and gathering
�
shared decision-making fairly equally among members in a band
�
each person skilled in every type of job
�
diet: small and large mammals, fish, shellfish, fruits, wild greens and vegetables, grains, roots, and nuts
�
approximately 10,000 years ago wooly mammoth became extinct
�
nomadic based on time of year or movement of animal herds
�
cooked meat by roasting it on a spit over a fire or by boiling it inside a piece of leather secured by a twig
�
gathered herbs
�
made everything themselves: tools, homes, clothing, medicines, etc.
�
worked about 2-3 hours a day getting food
�
worked about 2-3 hours a day making and repairing tools and clothes
�
spent remainder of day relaxing with family and friends
�
told stories, danced, sang, and played games
�
owned very few possessions
�
no concept of rich or poor
�
communicated through art (painting and sculpture) and the spoken word
�
buried their dead and had concepts of religion and an afterlife
�
sometimes adorned themselves with ornaments and decorations such as jewelry, tattoos, body painting, and
�
elaborate hairstyles
1. What would a person from 12,000 years ago find familiar and/or different about life today? Select relevant
information from the fact sheet and use your own knowledge to write a report.
Before writing, consider

the focus/thesis of your report
�
the supporting details in your report
�
the significance of the information in your report
�
A complete response to the prompt will include

✔ a clear purpose/focus
✔ coherent organization
✔ details/elaboration
✔ well-chosen language and a variety of sentence structures
✔ control of conventions
20
Grade 11 NECAP Prompt—Ice Age (continued)
Grade 11 NECAP Scoring Rubric for Ice Age
Scoring Guide:
Score Description
� purpose is clear throughout; strong focus/controlling idea OR strongly stated purpose focuses the writing
6 � fully developed details, rich and/or insightful elaboration supports purpose
� distinctive voice, tone, and style enhance meaning
� purpose is clear; focus/controlling idea is maintained throughout
5 � details are relevant and support purpose; details are sufficiently elaborated
� strong command of sentence structure; uses language to enhance meaning
� purpose is evident; focus/controlling idea may not be maintained
� generally organized and coherent
4 � details are relevant and mostly support purpose
� writing has a general purpose
3 � some relevant details support purpose
� uses language adequately; may show little variety of sentence structures
� attempted or vague purpose
� attempted organization; lapses in coherence
2 � generalized, listed, or undeveloped details
� may lack sentence control or may use language poorly
� minimal evidence of purpose
� little or no organization
1 � random or minimal details
Blank No response
21
Grade 11 NECAP Prompt—Reflective Essay
Reflective Essay
Read this quotation. Think about what it means and how it applies to your life.
“ ”
The cure for boredom is curiosity. There is no cure for curiosity.
—Dorothy Parker
Write your response to prompt 1 on pages 3 through 5 in your Student Answer Booklet.
1. What does this quotation mean to you? Write a reflective essay using personal experience or observations to
show how the quotation applies to your life.
Before writing, consider

what the quotation means to you
�
what experience/observations support your ideas
�
how your ideas connect to the larger world
�
A complete response to the prompt will include
✔ a clear purpose/focus
✔ coherent organization
✔ details/elaboration
✔ well-chosen language and a variety of sentence structures
✔ control of conventions
22
Grade 11 NECAP Scoring Rubric for Reflective Essay
Scoring Guide:
Score Description
� purpose is clear throughout; strong focus/controlling idea OR strongly stated purpose focuses the writing
6 � fully developed details, rich and/or insightful elaboration supports purpose
� distinctive voice, tone, and style enhance meaning
� purpose is clear; focus/controlling idea is maintained throughout
5 � details are relevant and support purpose; details are sufficiently elaborated
� strong command of sentence structure; uses language to enhance meaning
� purpose is evident; focus/controlling idea may not be maintained
� generally organized and coherent
4 � details are relevant and mostly support purpose
� may show inconsistent control of grade-level grammar, usage, and mechanics
� writing has a general purpose
3 � some relevant details support purpose
� uses language adequately; may show little variety of sentence structures
� may contain some serious errors in grammar, usage, and mechanics
� attempted or vague purpose; stays on topic
� little evidence of organization; lapses in coherence
2 � generalizes or lists details
� lacks sentence control; uses language poorly
� errors in grammar, usage, and mechanics are distracting
� lack of evident purpose; topic may not be clear
� incoherent or underdeveloped organization
1 � random information
� serious and persistent errors in grammar, usage, and mechanics throughout
Blank No response
23
Appendix B: Scoring Study Rubrics
24
Scoring Rubric — Holistic and Support
Holistic Rating 1 2 3 4 5 6
Overall Effectiveness of Explanation of Not effective Limited Inconsistently Generally Highly Highly effective with
Argument in Accomplishing Purpose at all effectiveness effective effective effective distinctive qualities
Focus/
Organization Support Language/Tone/Style Conventions
Coherence
Far from Overall structure Ideas/claims/position are Lacks focus, Language/vocabulary and Many errors (major
Meeting lacking, ideas disor- insufficiently or poorly often off topic. tone inconsistent or gener- and minor) in grammar,
Expectations ganized, few or no supported with little, ally inappropriate for audi- usage, and mechanics,
1 transitions. ambiguous, or no details/ ence and purpose; little or frequently interfering
1
evidence. no variation in sentence with meaning.
structure.
Approaching Inconsistent organi- Ideas/claims/position are Inconsistent or Language/vocabulary and Many errors in grammar,
Expectations zation and sequenc- unevenly supported with some uneven focus, tone somewhat incon- usage, and mechanics,
2 ing of ideas, some details/evidence. some straying sistent for audience and some interfering with 2
transitions, but some from topic. purpose; some variation in meaning.
noticeably lacking. sentence structure.
Meets Adequate organiza- Ideas/claims/position are sup- Clear focus Language/vocabulary and Some errors in grammar,
Expectations tion, mostly logical ported with adequate relevant throughout tone mostly appropriate usage, and mechanics,
3 ordering of ideas, details/evidence. most of essay. for audience and purpose; but few, if any, interfer- 3
sufficient transitions. In argumentation, identifies adequate variation in sen- ing with meaning.
counterclaim(s). tence structure.
Exceeds Clear, overall orga- Ideas/claims/position are fully Strong focus Language/vocabulary and Few, if any, errors in
Expectations nization structure supported with convincing, rel- throughout tone consistently appro- grammar, usage, and
(discourse units), evant, clear details/evidence. essay. priate for audience and mechanics, and none
4 logical sequencing purpose; effective variation interfering with mean- 4
In argumentation, clearly dis-
of ideas, effective tinguishes claims and counter- in sentence structure. ing.
transitions. claims, refuting latter.
25
Scoring Rubric — Holistic and Five Analytic Traits
Holistic Rating 1 2 3 4 5 6
Overall Effectiveness of Explanation of Not effective Limited Inconsistently Generally Highly Highly effective with
Argument in Accomplishing Purpose at all effectiveness effective effective effective distinctive qualities
Focus/
Organization Support Language/Tone/Style Conventions
Coherence
Far from Overall structure Ideas/claims/position are Lacks focus, Language/vocabulary Many errors (major
Meeting lacking, ideas disor- insufficiently or poorly often off topic. and tone inconsistent or and minor) in gram-
Expectations ganized, few or no supported with little, generally inappropriate mar, usage, and
1 transitions. ambiguous, or no details/ for audience and pur- mechanics, frequently
1
evidence. pose; little or no variation interfering with mean-
in sentence structure. ing.
Approaching Inconsistent organi- Ideas/claims/position are Inconsistent or Language/vocabulary Many errors in
Expectations zation and sequenc- unevenly supported with uneven focus, and tone somewhat grammar, usage, and
ing of ideas, some some details/evidence. some straying inconsistent for audi- mechanics, some inter-
2 transitions, but from topic. ence and purpose; some fering with meaning.
2
some noticeably variation in sentence
lacking. structure.
Meets Adequate organiza- Ideas/claims/position are Clear focus Language/vocabulary Some errors in
Expectations tion, mostly logical supported with adequate throughout and tone mostly appro- grammar, usage, and
ordering of ideas, relevant details/evidence. most of essay. priate for audience mechanics, but few, if
3 sufficient transi- and purpose; adequate any, interfering with
3
In argumentation, identifies
tions. counterclaim(s). variation in sentence meaning.
structure.
Exceeds Clear, overall orga- Ideas/claims/position are Strong focus Language/vocabulary Few, if any, errors in
Expectations nization structure fully supported with convinc- throughout and tone consistently grammar, usage, and
(discourse units), ing, relevant, clear details/ essay. appropriate for audience mechanics, and none
logical sequencing evidence. and purpose; effective interfering with mean-
4 of ideas, effective variation in sentence ing. 4
In argumentation, clearly
transitions. distinguishes claims and structure.
counterclaims, refuting
latter.
26

Computer Scoring of Student Writing

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Computer Scoring of Student Writing

Hochgeladen von

Copyright:

Verfügbare Formate

October 2014

Combined Human and Computer Scoring

Rationale for the Scoring Approach

Design of the Study

The Essays Used

Passage Scorer Correlation

* e-rater® Lightning intercorrelations being lower than

TABLE 2: AVERAGE INTERCORRELATIONS OF ANALYTIC TRAITS* – GRADE 11

Passage Scorer Correlation

TABLE 3: CORRELATIONS BETWEEN HUMAN HOLISTIC AND SUPPORT SCORES – GRADE 8

TABLE 4: CORRELATIONS BETWEEN HOLISTIC AND SUPPORT SCORES – GRADE 11

TABLE 5: CORRELATIONS BETWEEN SCORES FROM SCORER 1 AND SCORER 2 – GRADE 8

Passage Holistic Organization Support Focus Language Conventions

TABLE 6: CORRELATIONS BETWEEN SCORES FROM SCORER 1 AND SCORER 2 – GRADE 11

Passage Holistic Organization Support Focus Language Conventions

Decision Accuracy/Consistency and Standard Errors of Measurement

TABLE 8: DECISION ACCURACY (AND CONSISTENCY) – GRADE 8

TABLE 9: DECISION ACCURACY (AND CONSISTENCY) – GRADE 11

TABLE 10: STANDARD ERRORS AT CUTS* – GRADE 8

Scorer Discrepancies (>1)

TABLE 12: HUMAN SCORING DISCREPANCIES BY METHOD

TABLE 13: FREQUENCY OF HOLISTIC SCORE DISCREPANCIES >1 – GRADE 11

Write your response to prompt 13 on page 23 in your Student Answer Booklet.

When writing a response to prompt 17, remember to

Grade 8 NECAP Scoring Rubric for School Lunch

Everyday Life at the End of the Last Ice Age

Before writing, consider

A complete response to the prompt will include

Before writing, consider

Das könnte Ihnen auch gefallen

* e-rater® Lightning intercorrelations being lower than