Assessment Development For ESL Composition

Running Head: ASSESSMENT DEVELOPMENT FOR ESL COMPOSITION 1
Assessment Development for ESL Composition: An Achievement Assessment for Formative
Purposes
Yuanyuan Sun, Courtney Van Evera & Kiley Miller
Colorado State University

ASSESSMENT DEVELOPMENT FOR ESL COMPOSITION 2
Introduction
Writing is among the most important skills that ESL students need to develop. Meanwhile,
the ability to teach writing is a significant part of the expertise of a well-trained language teacher
(Hyland, 2003). As current and future teachers of writing, we value and administer reliable and
valid assessment in order to successfully teach students and to measure students writing
development. More specifically, in this paper, we focus on developing a take-home achievement
test for formative purposes to benefit teaching and learning in an ESL composition course.
According to Weigle (2012), students who learn to write in a second language context
generally need to write for school purposes, and they have an immediate need to master the
genres and conventions of writing in the target language. Assessments of writing vary widely
and may serve different functions. The four main purposes for language tests in academic
settings are proficiency, placement, diagnostic, and achievement (Weigle, 2012). Among them,
achievement tests are mostly classroom-based assessments which help to determine whether
students have mastered specific skills or knowledge theyve learned (Weigle, 2012). Miller,
Linn, & Gronlund (2009) describe the achievement test design as one which demonstrates
student learning and success and typical performance assessment as students typical behavior,
referring to what they can do now, rather than what they will do in the future. This rationale and
design is most appropriate for our assessment, since the take-home test is designed to evaluate
students performance demonstrating their learning outcomes by a certain point of instruction in
the course and to check their knowledge of course content so far, rather than predict their
achievement at the end of the course.
Furthermore, Cumming (2002) pointed out a significant issue in academic-purpose
assessments of second-language writing, that is, to improve the formative value of assessment
for students learning. Formative assessment is defined in many ways and is also known as
Assessment for Learning which helps us begin to clarify the purpose and reason for
administering such assessments (Burner, 2016). In writing, the focus tends to be on the process,
as opposed to the product, and there is strong emphasis on writing in multiple disciplines,
especially in American contexts (Reising, 1997). While Burner (2016) acknowledges the content
difficulties that accompany any writing-based curriculum, value remains in teaching and
formatively assessing writing. Central to formative assessment are providing feedback,
understanding the purpose and trajectory of assessment, and involving student learners in the
development process (Burner, 2016; Becker, 2016). Such approaches support constructivist
theories of learning where students are more active in the learning process (Edens & Shields,
2015). First-year ESL composition courses at universities with homogeneous ability student
groups, then, are prime candidates for such assessments as they utilize writing skills to integrate
and demonstrate understanding of rhetorical content, and employ higher levels of Blooms
Taxonomy in synthesizing sources and conducting evaluations through rhetorically critiquing
texts, by conducting peer review and providing constructive criticism, and performing self-
reflections and evaluations; writing serves as the medium to accomplish this wide variety of
tasks.
In respect to the form of assessment, compared to traditional selected response items, more
and more researchers begin to investigate the effectiveness of integrated writing tasks, which are
increasingly used in L2 writing assessment (Gebril & Plakans, 2014). Many researchers claim
that integrated writing tasks replicate the actual practices in academic contexts where discourse
synthesis is a common exercise in university writing, since they require learners to synthesize
information from external sources in their writing product (Gebril & Plakans, 2014). Therefore,
this kind of integrated assessment methods can augment authenticity and help to elicit the
academic writing construct in a better way (Gebril & Plakans, 2014).
This paper provides a detailed overview of an achievement assessment for formative
purposes that was developed for an ESL composition class at Colorado State University (CSU).
To begin with, a description of the test is provided. The description includes the purpose of the
test, the type of the test, the interpretation of scores, TLU domain, the construct of the test, the
Table of Specifications and the description of test tasks. Next, the section of the test procedure
provides information about the participants, administration and scoring procedures. The test
results are presented in the next section, followed by the section of discussion on the overall
critique and evaluation of the test. In the discussion, an overall estimation of the tests
effectiveness as well as the reflection on personal significance of test development process are
also provided.
Description of the Test
Purpose of Test
We developed the take-home assessment specifically for CO 150-I, the international
section of first-year composition course offered by CSU as an option but not a requirement for
ESL multilingual students. The take-home assessment works especially well with the CO 150-I,
since the major writing assignments are scaffolded and build on each other. The CO 150-I is
divided into four units of A1, A2, A3 and A4 based on syllabus, and the units are concluded with
different writing assignments which play the role of assessing the instructed content of each unit:
A2 prioritizes research and results in an annotated bibliography; A3 focuses on the people
(stakeholders) involved in the issues that students selected for research and results in a
stakeholder analysis; and A4 focuses on making an informed argument to one of these groups
using all the students prior knowledge, resulting in a researched argument. These writing
assessments, while summative in that they encompass the main ideas of the unit, are also
formative in that they build on each other and provide the instructor with information regarding
the gaps in instruction and can inform areas of focus for the next assessment (Miller, Linn &
Gronlund, 2009). The take-home test developed by my colleagues and I focuses on bridging A3
and A4 and is informed by student achievement in this first take-home assessment that bridges
A2 to A3, following a similar format and serving a similar purpose to increase validity and
reliability. While this type of assessment could serve to link any summative assessment to
another, this is particularly relevant to scaffolded summative assessments, and it serves to
reinforce the course objectives as well as prepare students for their next unit.
The CO 150-I class has two major assignments which are also used as assessments:
assignment A3 is a stakeholder analysis and assignment A4 is an argumentative research
assignment. Our formative assessment is a bridge between A3 and A4, with the directive to
identify a stakeholder to whom to pitch an argument. The purpose of this assessment centers on
the CO 150-I course objective to understand writing as a rhetorical practice, and therein, choose
effective strategies for addressing purpose, audience, and context.
This assessment has several traceable impacts on students and teachers. Firstly, this is a
low-stakes assessment, so student anxiety will therefore be low. Students will be able to self-
assess by viewing the rubric prior to the test and reviewing their work afterwards. There will be
positive washback of rhetorical concepts for students, as this assessment focuses on concepts
presented and utilized in class, it prompts students to review class content and ensures that
students mastered the major rhetorical concepts. And the teacher will use this data to adjust
instruction to make it more useful, and she will use this assessment and the results thereof to
connect assignments A3 and A4, adding to the cohesion of the course.
Type of Test
The take-home assessment is designed to be an alternative assessment and an achievement
test based on the CO 150-I class syllabus. It is used to measure and evaluate students knowledge
and understanding of class content and instructions of units A1, A2 and A3, which are required
for students to accomplish the upcoming unit A4 of writing an argument essay successfully.
Furthermore, the assessment is also used formatively for pinpointing students major errors as
well as identifying students potential problems and difficulties. Since the class units build on
each other, the problems identified should be beneficial to help both students and the instructor
to avoid possible failure in assignments following the test.
Score Interpretation
The results of the take-home assessment is dealt with using criterion-referenced
interpretation. More specifically, the assessment focuses on limited and clearly defined tasks.
Students individual performance on conducting specific tasks is described and evaluated
according to rubrics based on concrete and specific class objectives and writing skills outlined
above. Meanwhile, students performance isnt compared against each other. These points
factors into the Process Work category of assessments, making up 10% of the overall course
score.
Specific Description of the TLU Domain
The target language use (TLU) domain for this assessment is the CO 150-I classroom.
Typical tasks therein encompass pieces of rhetorical writing that are often take-home
assignments, such as homework or drafting segments of larger projects. Major assessments are
scored by rubrics, and align with course objectives; most homework assignments are evaluated
based on rubrics with varieties of point scales. Our writing assessment had components of all the
cognitive skills of Blooms Taxonomy (1956). First, knowledge, comprehension, and analysis
were assessed, where students defined terms and provided limited responses to concept-centered
questions. For synthesis and evaluation, students responded to a writing prompt, where they
applied course concepts by choosing an appropriate or inappropriate stakeholder for a situation
and providing a rationale. This assessment requires the rhetorical use of audience, voice, and
context, which are part of the course objectives and practiced often in in-class and take-home
assignments. A table with the break down of TLU task characteristics for a typical writing task is
offered in Appendix A.
Construct Definition
CO 150-I course focuses primarily on understanding rhetorical elements to compose a
variety of texts effectively; writing is used to assess this rhetorical knowledge. The course
objectives include, Developing critical reading practices to support research and writing;
understanding writing as a rhetorical practice, i.e. choosing effective strategies for addressing
purpose, audience, and contexts; learning important elements of academic discourse[to write]
effective...arguments; [and] developing effective research and writing processes. Thus many
aspects of writing were assessed, though assumptions were made and some skills were excluded
in the test.
Assessed skills. This assessment specifically evaluated writing skills, which was
evidenced by a variety of aspects of communicative language ability.
Organizational knowledge. Students grammatical knowledge and understanding of
vocabulary items in the selected response questions was assessed; these vocabulary items related
to rhetorical concepts (e.g. stakeholder, purpose, etc.) and not to the students selected topics for
research. Students syntax through their writing was assessed, though this was not the primary
focus as the rubric demonstrates. Within the short and extended responses, students textual
knowledge was assessed through cohesion within sentences. Coherence was demonstrated in
their ability to present a topic sentence, related evidence, and an explanation which should
connect the topic sentence and evidence to fully answer the prompt, also outlined in the rubric.
Pragmatic knowledge. Functional knowledge was assessed through their ability to
engage in ideational functions. For example, defining and utilizing rhetorical terms and concepts,
which are required tasks in the classroom, or TLU domain (see Appendix A). Knowledge of
manipulative functions was assessed as rhetoric inherently affects the audience and stakeholders
of the writing, and the world at large. A knowledge of heuristic functions was assessed. In this
case, problem solving is the function at hand.
Assumed knowledge. Students were expected to write in formal, academic English, so
sociolinguistic knowledge (for this particular take-home assessment) was assumed since the
students had been writing for the instructor all semester and should understand these
expectations. Additionally, reading skills and topical knowledge of their chosen areas of research
was also assumed since students conducted their own research, independent by each other. While
all the topics chosen by students were related to higher education such as tuition, this knowledge
was assumed since students had been working with these topics for several weeks at the point in
the semester. Students used this assumed knowledge to articulate their understanding of the
rhetorical concepts of argumentation. This assessment served as one aspect of the writing
process: helping students to evaluate appropriate stakeholders to begin organizing their research
and eventually craft an effective argument. Listening skill was assumed, as part of the lecture
was given orally with the support of written and textual aids.
Elements not evaluated. Students speaking skills were not assessed, as the instructor only
evaluated their written output.
Table of Specifications
The Table of Specifications offers the design for the assessment to ensure a
representative sample of tasks is included in the process of developing an assessment (Jamieson,
2011). The Table of Specifications for this take-home assessment is a two-way chart that relates
the instructional objectives to tasks (see Appendix B). The table indicates the total points and the
percentage of points allotted to various instructional objectives which are more widely
categorized according to the levels of Blooms taxonomy in relation to each task. The percentage
indicates the amount of emphasis on each area in the assessment as well.
The Table of Specifications lists first the level of Blooms taxonomy (1956), further
divided by the objectives the assessment is intended to measure, across the top row, with the
tasks down the first column. The objectives are rephrased from the course objectives according
to the application in this assignment and are matched with three main categories based on
Blooms taxonomy of educational objectives including knowledge and understanding,
application and analysis, synthesis and evaluation. Each main category from the taxonomy is
demonstrated by two subcategories of instructional objectives based on the course; for example,
knowledge and understanding includes rhetorical concepts and style and convention. The last
row at the bottom shows the percentage allocation of points for each objective. Generally
speaking, the points are spread relatively evenly among objectives: rhetorical concepts covers
20% of the assessment, style and convention covers 10%, stakeholder values covers 20%,
applied knowledge of audience (manipulative functions) covers 20%, organization (cohesion,
coherence) covers 10% and development of evidence and explanations covers 20%. The last
column on the right shows the percentage allocation of points for the tasks of the assessment.
Altogether three tasks are designed to measure objectives, and they are weighted quite
differently. Definition makes up 10% of the assessment, T-charts make up 30%, and the
extended response makes up 60% of the score.
There are several implications for the assessment based on the Table of Specifications.
To begin with, the table indicates that the tasks measure different levels of complexity of
learning outcomes. While the task of definitions focuses on assessing students knowledge of
rhetorical concepts, the extended response item is developed to measure the majority of the
objectives listed in the chart. Moreover, the table indicates the time allocation for each items for
students. Taking the distinct weight into consideration, students are implicitly encouraged to
spend most of their time constructing answers for the last two tasks. The extended response item
requires the most time investment particularly, as it accounts for the majority of the points and
largest percentage of the score.
Description of Test Tasks
This take-home assessment (see Appendix C) consisted of three tasks which essentially
comprise the three different parts: a definition, a graphic organizer to be filled in (more specifically,
T-charts), and finally a writing prompt where students produce an extended response. The first two
tasks were limited production items, while the third was an extended production item. Tasks were
specifically constructed to follow Blooms Taxonomy, with the tasks progressing to become more
difficult which also necessarily engage students in lower levels of thinking to enact higher levels
of thought processes (Jensen, McDaniel, Woodard, & Kummer, 2014). The first task, the definition,
assessed fundamental knowledge and matched the first and second tiers of the taxonomy focused
on knowledge and understanding; the second task assessed comprehension through identifying
influences and consequences via the graphic organizer; the final task reached the highest levels
of the taxonomy as students defend and evaluate their thought processes and choices which can be
categorized as analysis, synthesis, and application (Usova, 1997, p. 103). These tasks similarly
reflected the difficulty level.
The directions for the whole test and individual task were provided in written English. This
first task, defining a term from the course objectives, required students to recall knowledge, or at
least to review course materials to find the answer, as students are specifically referred to these
materials in the instructions. The second task required students to first report information they
developed from the previous summative assessment. Filling the T-charts requires students to
compare the gains and losses of each stakeholder group by applying knowledge of their material
and analyzing the situation by thinking critically about the hypothetical situation they have been
researching. In the extended response, students had to choose a stakeholder and develop two well-
reasoned paragraphs to defend their choices. All the tasks were non-reciprocal. The relationship
between input and response was fairly indirect, because the tasks built upon previous tasks and
topical knowledge.
When it comes to scoring, the first two tasks allowed partial credits based on the test key
(see Appendix C). The extended response item was scored according to a specific rubric (see
Appendix C) outlining questions that specify the criteria and are provided to guide the students
self-evaluation and serve as prompts, fitting the description of both holistic and analytic rubrics
(Becker, 2017). Mathematically, the general descriptors in the rubric equate to: excellent
earning full points; good earning around an 83% or a B-equivalent score; satisfactory equaling
around 66%, which roughly indicates the cut-score of 60%; and finally unsatisfactory
receiving a 50% or lower, likely indicating a lack of attention or a completely missing feature.
The specific scales are provided in the rubric within each descriptor and are scored according to
the evaluation of each criteria in the rubric, which further outline the categories outlined in the
Table of Specifications. For example, Synthesis and Evaluation is further defined as
Development of Evidence and Explanations, worth two points. These are expressed in the
section of the rubric, Development and Evidence, as the criteria Do paragraphs have clear
topic sentences, evidence, and explanations? and Are references used appropriately to help
develop the paragraphs? Table 1 outlines this sample scenario. If a student has used thoughtful
topic sentences, supplied evidence, and very clearly explained the connection between these
features, as indicated in the Expected Response, the student would likely earn excellent. If the
student uses multiple references in a single paragraph and one in another paragraph, this may still
be considered a good use of references, but not excellent since the student has not
demonstrated synthesis in both paragraphs. Each category is assessed in this way, using the
marginal commentary as a basis for completing the rubric, leading the evaluators to mark an X
following the criteria under and the descriptor. Scoring for these markings are outlined according
to the scale, or statistically averaged between two categories in the case of an even distribution of
descriptors. The example above would appear for the student as indicated below in Table 1.
Table 1
Modified Rubric Sample
E G S U Score
2 1.66 1.33 <1
Organization Do paragraphs have clear topic X
and sentences, evidence, and 1.83
Development explanations?
(4) Are references used X
appropriately to help develop the
paragraphs?
This even distribution of the evaluation necessitates an average of the two scores, resulting in the
average of the two point distributions for this category of organization, which is 1.83 points.
Commentary precedes the completion of the rubric, giving the evaluator an indication of the
strength of the response and clearly evaluate and respond in fairly and thoroughly.
Test Procedure
Participants
There were 19 student participants with at least mid-intermediate proficiency, since they
have tested into first-year composition or passed pre-requisite classes, like an IEP program or
additional intensive composition classes. Students were between the ages of 18 and 22. Students
came from a variety of countries, speaking a variety of languages, including: 1 student from
South Korea (L1 Korean), 1 Ethiopian student (L1 Amharic), 1 student from Kuwait (L1
Arabic), 2 Omani students (L1 Arabic), 1 Saudi Arabian student (L1 Arabic), and 12 Chinese
students (L1 Chinese). There were 7 male students, and 12 female students. 17 students came to
the U.S. specifically for college within the last two years, and 2 completed intensive English
programs before or concurrent with their CSU studies, while others completed at least one year
of college in their home countries before transferring to the U.S. One student moved to the U.S.
and completed high school in an American setting, and 1 student was born and lived in the U.S.
until age 3 when she moved back to her home country, then back to the U.S. for college.
Administration
The hard copies of the assessment were assigned on a Wednesday at the end of class and
collected the next class period on Friday afternoon. The instructor informed students of the
purpose and procedure of the test orally in class. Student were also asked to read directions
carefully. Students was allowed three days (72 hours) in total to complete the test. This
timeframe allowed students time to produce, review, and revise, which is emphasized in course
objectives. Students submitted the test with hard copies at the beginning of the class on Friday.
Scoring Procedures
We then scored the tests, and each of us scored six tests, using the same answer key and
rubrics for the extended response item (see Appendix C). After the test had been piloted, raters
met together, where the instructor of the CO 150-I class had prepared graded samples of the test
in terms of high, medium, and low grades. We discussed the grading method, asked questions
about interpretations of the rubric, and graded a couple more papers as a group. Then we graded
the rest of tests separately. The instructor collected all the copies of test attached with scoring
reporting forms (see Appendix C) and returned them to students.
Test Results
Students were assigned three tasks for the take-home assessment, which came to a total of 20
points. Table 2 below outlines the task statistics for all three tasks, organized by student and
arranged according to task, with the students individual final score. The numbers in parenthesis
next to the task description indicate the total points possible for each task. 18 student scores were
reported. Scores for the definition ranged from zero to two (M = 1.28, SD = 0.65), the T-charts
ranged from 3.33 to six (M = 5.57, SD = 0.71), and the extended response question ranged from
8.41 to 12 (M = 9.98, SD = 1.12). Final scores ranged from 14.07 to 19.51 (M = 16.87, SD =
1.45).
Table 2
Individual Task Score Report

Student Definition (2) T-Chart (6) Response (12) Final (20)
1 1 6 9.33 16.33
2 2 5.66 11.16 19.51
3 1.5 4.66 10.08 16.24

4 1.5 6 9.08 16.58
5 1.5 6 11 18.5
6 0.5 3.33 11.75 15.58
7 2 6 11.49 19.49
8 1.5 5 10.32 16.82
9 1 6 8.41 15.41
10 1.5 5.33 8.82 15.65
11 1.5 6 9.65 17.15
12 2 6 8.56 16.56
13 1.5 5.33 9.76 16.56
14 0 6 12 18
15 2 6 10.49 18.49
16 1.5 6 9.32 16.82
17 0 5 9.07 14.07
18 0.5 6 9.32 15.82
Each assessment was reviewed by a single reviewer due to time constraints and students
needs for feedback for the next assignment. Each rater scored six assessments, with Rater 1 (who
is also the instructor of the course) completing the first six and distributing a high (a score
around 19), medium (score of around 17-18), and a low (score of 16 or below). Scores are
presented in Table 3 below and divided by rater with mean and standard deviation provided for
each rater and task. Rater 1 scored the highest most consistently, for the extended response and
final scores, while Rater 3 scored the lowest most consistently, for the extended response and
final scores. Rater 2 only averaged the highest score for the definition, and was otherwise the
middle scorer in each category. However, nearly all raters averages fell within one standard
deviation of the other raters scores, with the only exception being Rater 1 and Rater 3s
extended response scores which differed. Additionally, Rater 1s final scores do not fall within
one standard deviation of Rater 3s scores.
Table 3
Rater-Based Individual Scoring

Student Definition T-charts Response Final
Rater 1 2 2 5.66 11.16 19.51
7 2 6 11.49 19.49
14 0 6 12 18
8 1.5 5 10.32 16.82
13 1.5 5.33 9.76 16.56
17 0 5 9.07 14.07
M 1.17 5.5 10.63 17.41
SD 0.93 0.46 1.11 2.06
Rater 2 5 1.5 6 11 18.5
15 2 6 10.49 18.49
4 1.5 6 9.08 16.58
3 1.5 4.66 10.08 16.24
10 1.5 5.33 8.82 15.65
6 0.5 3.33 11.75 15.58
M 1.42 5.22 10.2 16.84
SD 0.50 1.07 1.12 1.33
Rater 3 11 1.5 6 9.65 17.15
16 1.5 6 9.32 16.82
12 2 6 8.56 16.56
1 1 6 9.33 16.33
18 0.5 6 9.32 15.82
9 1 6 8.41 15.41
M 1.25 6 9.1 16.35
SD 0.52 0 0.49 0.64
Pearson correlation scores were analyzed for each rater, and while the individual tasks,
specifically the definition and response showed little correlation, the overall final scores
indicated very high correlation, with a Pearson correlation coefficient of 0.92. The T-chart task
also resulted in relatively high correlation at 0.75. Table 4 below outlines all coefficient scores.
Table 4
Pearson Correlation Coefficients by Task

Task Pearson Correlation Coefficient
Definition 0.26
T-Chart 0.75
Response -0.08
Final 0.92
With so few students, no overlap between raters, and so few items, the standard error of
measurement could not be calculated due to the limited data. However, the extended response
question elicited data in several categories and were comparable, as Table 5 illustrates the
statistics within each category of the rubric. The numbers alongside the scoring category indicate
the total possible score for each area, which corresponds with the Table of Specifications. When
compared, these scores demonstrated a Cronbachs Alpha of 0.804, indicating high reliability for
the extended response.
Table 5
Extended Response: Item Statistics
Scoring Category Mean Standard Deviation
Audience and Rhetoric (6) 4.98 0.58

Organization (4) 3.27 0.40
Style and Conventions (2) 1.73 0.29
Overall, students performed well as none scored below the cut score (60%, or 12 points),
which was determined based on the university grading policy. These results may be expected for
a formative assessment in which students were allowed to use their notes. This would indicate
that each student achieved mastery of the content.
Discussion
The results from the pilot assessment reveal many important implications, which can be
discussed around various characteristics of assessment procedures.
Critique of Task Performance
Ultimately, the limited number of participants and data reduces the generalizability for
these results, but these tasks still produced relevant and useful data that informed the formative
purposes of this assessment and will inform the redesigning of the assessment. The first task, the
definition, showed a very wide range of scores within one SD of the mean, indicating that
students scored across the entire range possible. The T-charts demonstrated greater range, but
also had a larger possible score and was the overall highest scoring task in terms of percentage
(0.93), which may be because students drew this information not from the class lecture or notes,
but from their own previous summative assessment. This task also required only knowledge and
understanding (Bloom, 1956), compared to the definition which was dependent on students notes
or finding the information in the course lectures or PowerPoints, or the extended response which
required synthesis and evaluation skills, which are more difficult and sophisticated (Bloom,
1956). Still, the scores demonstrated a range that seemed appropriate and is largely consistent
with the range of overall scores for students in the course.
When scores were divided by rater, there were clear differences between the scores that
each rater assigned and demonstrating interrater reliability would not only have yielded more
data, but would have provided an indication of what scores were being inflated or too harshly
graded. Due to the time constraints and practicality, this step was not followed and may affect
the range of data represented in this pilot study. For future implementations, this would be a
useful step for both student scores and for instructors to be able to ensure consistency.
Evaluation of Test Usefulness
The test was overall evaluated to be useful, primarily because of its adherence to
assessment constructs, including reliability, validity, and impact. Together with the results and
performance across each task, this task successfully achieved its purpose too.
Reliability. The reliability of the extended response portion of the assessment in terms of
the Cronbachs alpha was .804, which is reasonably high. Our largest goal in making our
assessment reliable was having good interrater reliability. First of all, we three collectively
designed the rubric for the assessment, which increased the plausibility of our grading being
similar. Additionally, scoring procedures mentioned above also ensure our interrater reliability.
While a more analytical rubric with distinct categories for each rating within the extended
response would likely have resulted in higher consistency among raters, constructing such a
rubric for a formative assessment is not practical. While we werent able to evaluate interrater
reliability statistically by scoring the same assessments, we were able to calculate internal
consistency for the assessment between raters, scoring a Pearson correlation coefficient of 0.92,
despite varying correlation coefficients for the individual tasks. More communication between
raters and interrater reliability, if raters are able to score the same assessments for comparison,
may be the most practical solution to achieve higher reliability. Additionally, if other similar
bridge assignments, that connect two summative assessments, were added for additional tasks
to score, or more students were available for the pilot, a better understanding of reliability could
be achieved.
Validity. Constructs for this assessment are derived from the following course objectives
from the source syllabus: Developing critical reading practices to support research and writing;
Understanding writing as a rhetorical practice, i.e. choosing effective strategies for addressing
purpose, audience, and contexts; Learning important elements of academic discourse[to write]
effective...arguments; [and] Developing effective research and writing processes. Constructs
assessed were writing skills, particularly in the realms of organizational and pragmatic
knowledge. Since the assessment modeled past and future assignments, the measure of students
abilities within these constructs was consistent between performances on similar assignments,
which could support and indicate validity. The assessment had a high level of interactiveness
because the construct involved organizational and pragmatic linguistic knowledge and
knowledge of a certain topic of research and stakeholders. The TLU domain for this assessment
was the course itself, and future assignments therein. The assessment mirrored past assignments
which had taken place, and was intended to mirror future assignments and writing assessments,
particularly the A4 Academic Argument, which was the largest summative assessment of the
course.
The instructions of the test tasks resembled instructions of other assignments and utilized
prior knowledge and skills gained in class. No problems were detected. Regarding input, one
area for review may have been the vocabulary utilized for tasks one and two. For task one, since
a definition was required, some students simply Googled the definition, rather than relying on the
class-specific definition provided, so several incomplete responses were given. This potential
confusion was supported in that the definition had the lowest average percentage score (0.64) of
any task. Similar definition tasks have been used before, increasing the validity of the
assessment, and students were referred to the notes in the instructions, so further investigation
may be needed to determine the reason for difficulty with this task. For the T-chart task for a
small number of students, there seemed to be some ongoing confusion about what stakeholders
gains and losses were. For formative purposes, the assessment did what it needed to by revealing
this problem of understanding vocabulary. Considering summative purposes for the larger A3
and A4 assessments that this bridge assessment links, this aspect of the assessment did not allow
a small number of students struggled with the differentiation between gains and losses which
may have caused them not to perform optimally. The relationship between input and response
proved to be very smooth, as predicted, because of the amount of in-class conditioning, and that
class was the TLU-domain.
One feature of the assessment that was potentially damaging to the assessment validity
was the generous scoring system. The first two tasks regarded class knowledge and skills which
were predicted to be reasonable for the students to recall or find in the class notes. The extended
response portion was the largest indicator of success on the assessment overall, which may be in
part because it was also the most heavily weighted part of the assessment. The rubric allotted six
points for Development and Rhetorical Knowledge, four points for Organization and
Development, and two points for Style and Convention. Further analysis of these categories
may reveal patterns in scoring and performance that would be beneficial to students and
developing appropriate instruction accordingly. With the lowest student score on this portion at
8.41 out of 12 possible points, this task would be worth investigating. With all students receiving
between 6 and 8 points on the first two sections, the lowest total score was a 14.07 out of 20,
which is 70%, a C-. The cut score was 12 out 20, so all students proved proficient. Its possible
that these scores were skewed to the high end because of the generous rubric. The time given for
tasks mirrored past and future tasks, and there were no detected difficulties regarding length of
time allotted. All in all, through the circumstance of a CO 150-I class at our disposal, our
assessment was highly authentic.
Impact. Assessments were returned to students at the end of class approximately 10 days
after they submitted the assignment, which is a longer timeframe than most electronically
submitted assignments with a single rater. This may have been frustrating to students who
demonstrate high priority on their schoolwork and are accustomed to receiving grades for similar
formative assessments within the week. However, no negative impacts were directly relayed to
the instructor. Since the assignments included considerable feedback and a clear rubric, students
were able to see exactly what points were missed and areas for improvement, which should have
had a positive effect on the highly motivated group of students. Students proved their ability to
receive and incorporate feedback and commentary in previous assignments, so its likely that this
feedback and the relatively detailed rubric provided areas for students to improve and continue to
learn.
In terms of instruction, this formative assessment provided valuable positive impacts on
teaching and the instructor was able to incorporate more instruction regarding gains and losses as
they relate to stakeholders and argumentative, persuasive writing techniques. The instructor was
able to remind students to rely on the notes provided in class and reviewed where to find the
information, since students performed most poorly on the item that required their notes or
resourcefulness to find the class PowerPoint lectures. 77.77% of students did not score full points
on this definition portion, so the instructor also reviewed synthesis as a rhetorical concept and
how to apply and demonstrate synthesis in writing and using resources. The style section of the
writing rubric was also an area for improvement for a lot of the students, so the instructor
reviewed MLA format and basic assignment requirements, like typing when required, as this was
also a factor that caused students to lose points. This deficiency also related to the instructions,
which may indicate that students were simply not reading the instructions carefully, which may
be an additional aspect to review. With the exception of the possible negative affective impact
regarding the length of time to return the assessment, the impact is believed to be largely
positive. Students were provided with additional learning opportunities through the use of a
rubric and the commentary received, and the instructor adapted instruction specifically to address
the short-comings revealed by this assessment.
Test Purpose
Our assessment was designed to be an achievement assessment used for formative
purposes. The summative purpose of the assessment was to measure linguistic skills and
knowledge from the A3 stakeholder analysis assignment before moving onto the A4 research
paper assignment. The formative purpose of the assessment was to reveal what areas need more
instruction between the A3 and A4 assignments. This assessment revealed that all students were
above the cut score, which was 12 out of 20 possible points, with the lowest score being 14.07
out of 20 possible points. This means that all students showed they had gained the linguistic
skills and knowledge from the A3 assignment through performance on this assessment. We were
also able to pinpoint some areas that needed further instruction. Our estimation is that our
assessment achieved its formative purpose.

Reflection
My colleagues and I developed this assessment as required by the class E 638 Assessment
of English Language Learners that we took for a TEFL/TESL graduate program in 2017. I have
learned a lot from the process of assessment development in many aspects through the steps of
test proposal, development, implement and results analysis. First of all, at the beginning the
semester, I barely knew anything about what the important concepts in assessment are and how
to create one. Its been very helpful for me to go through all the possible steps to develop a valid
and reliable test, especially when this type of curriculum-based achievement test is very
commonly seen in the EFL language classrooms in China that I expect to work in. Secondly, I
benefited a lot from the test development process by putting theory into action: I learned to use
TLU task tables to make sure my assessment is authentic; I learned to match the test constructs
to course goals and objectives; I learned to use the Table of Specifications to ensure tasks
operationalize the constructs well. On the other hand, it was very challenging and time
consuming to consider all these elements in the test at the same time so that they were valid
respectively and work together effectively. An example is when we tried to create the rubrics for
the extended response item, we spent lots of time together making sure all the criteria in the
rubrics showing what the task intends to measure match the test constructs. We also had to make
sure the points assigned to the criteria matches the points allocated in the Table of specifications.
I also learned many things from the process of piloting the test and scoring. I never
thought much about the instructional aspects of tests and now I realize instructions are extremely
important to help with students performance and guide them to study effectively and efficiently.
However, even after I thought we created good enough written instructions for the test, it still
surprised me when I graded students tests and realized how many of them tended to overlook
them. It encouraged me to think about the ways to make students realize the significance of
understanding instructions. As for scoring, since compared to my group members, I had little
teaching and grading experience before, I did learn a lot from them. For example, I practiced
how to give students corrective and constructive feedback without discouraging them. In
addition, I learned that while quantative description of the test could help to measure students
achievement, qualitive evaluation such as feedback and comments on the margin are very crucial
to help achieve the formative purposes of a test and make a test a real assessment for learning.
Last but not least, it was very helpful for me to learn not only how to analyze data using
technology tools such as SPSS (Statistical Package for the Social Sciences), but more
significantly, how to interpret the results to make them really useful and meaningful to tell the
effectiveness of test as well as inform about instruction decisions or learning.

References
Becker, A. (2016). Student-generated scoring rubrics: Examining their formative value for
improving ESL students writing performance. Assessing Writing, 29, 15-24.
Becker, T. (2017, February 15). E 638 Assessment of English Language Learners [Class
handout]. Fort Collins, CO. Author.
Bloom, B. S. (1956). Taxonomy of educational objectives, Handbook I: The cognitive domain.
New York, NY: David McKay Co, Inc.
Burner, T. (2016). Formative assessment of writing in English as a foreign language.
Scandinavian Journal of Educational Research, 60(6), 626-648.
Cumming, A. (2002). Assessing L2 writing: Alternative constructs and ethical dilemmas.
Assessing Writing, 8(2), 73-83.
Edens, K, & Sheilds, C. (2015). A Vygotskian approach to promote and formatively assess
academic concept learning. Assessment & Evaluation in Higher Education, 40(7), 928-
942.
Gebril, A., & Plakans, L. (2014). Assembling validity evidence for assessing academic writing:
Rater reactions to integrated tasks. Assessing writing, 21, 56-73.
Hyland, K. (2003). Second Language Writing. New York, NY: Cambridge University Press.
Miller, David M., Linn, Robert L., Gronlund, Norman E. (2009). Measurement and Assessment
in Teaching. Upper Saddle River: Pearson.
Reising, B. (1997). The formative assessment of writing. The Clearing House: A Journal of
Education Strategies, 71(2), 71-72.
Jamieson, J. (2011). Handbook of second language teaching and research. E. Hinkel. (Ed.). New
York, NY: Routledge.

Jensen, J., McDaniel, M., Woodard, S., & Kummer, T. (2014). Teaching to the test...or testing to
teach: Exams requiring higher order thinking skills encourage greater conceptual
understanding. Educational Psychology Review, 26(2), 307-329. doi:10.1007/s10648-
013-9248-9
Usova, G. M. (1997). Effective test item discrimination using Blooms taxonomy. Education,
118(1), 100-100.
Weigle, S. C. (2012). Assessing writing. In C. Coombe, B. OSullivan, P. Davidson, and S.
Stoynoff (Eds.), The Cambridge Guide to Language Assessment. Cambridge: Cambridge
University Press.
Appendix A
TLU Task Characteristics
Characteristics of the setting
physical characteristics Take home - students choice of environment
participants CO 150.404
time of task Friday after class-Monday class-time (72

hours)
Characteristics of the test rubric
instructions
language English (target language)
channel Written, visual with brief oral introduction
specification of procedures and tasks Selected response, written, brief and extended
response
structure Typical writing tasks
time allotment Days depending on the length of writing
scoring method
criteria for correctness Selected response: 0 = wrong, 1 = right

Constructed response: rubric
procedures for scoring the response 0-3 scale
explicitness of criteria and procedures Given, explicit rubric provided
Characteristics of the input
format
channel Written, oral lecture, visual

form Language
language English
length 50 minute class (MWF, during week)
type Verbal and non-verbal lecture, written, visual
degree of speededness Moderate, varied based on lesson and medium

(lecture content could serve as input)
vehicle Reproduced
language of the input English
language characteristics Academic
organizational characteristics Verbal with written and support depending on

lesson needs; written homework and reading to
supplement
grammatical Written instructions
textual Verbal and written instructions, Socratic

questioning, textbook readings, group verbal
exchanges, note-taking
pragmatic characteristics
functional Heuristic, ideational, manipulative,
sociolinguistic formal, colloquial, natural, polite, academic
topical characteristics Controversial issues in higher education such

as tuition, rhetoric and college composition
Characteristics of the expected response
format
channel Written
form Language, selected response
language English, target
length 72 hours (1-1.5 hours suggested)

type Selected response, short and extended response
degree of speededness Moderate
language of the expected response English
language characteristics Academic, formal, natural, polite, written
organizational characteristics Topic sentence and supported evidence
grammatical College-level vocabulary, standard English,

some specialized vocabulary
textual Paragraph organization, complete sentences
pragmatic characteristics
functional Idealistic, heuristic, manipulative
sociolinguistic Formal, natural, polite, academic
topical characteristics Controversial issues in higher education,

rhetoric and college composition
Relationship between input and response
reactivity Non-reciprocal, written, prompted, synthesis

of information
scope of relationship Moderate
directness of the relationship Direct

Extended Research Definition Tasks
Response Answer and
# Points
% Points
T-Charts
Rhetorical concepts
4
2
2
20
Style and Convention
2
2
Understanding
10
Knowledge and
Stakeholder Values
4
2
2
20
Applied Knowledge of
Analysis
4
4
Audience
20
Application and
(manipulative functions)
Appendix B
Organization
2
2
(cohesion, coherence)
10
Table of Specifications
ASSESSMENT DEVELOPMENT FOR ESL COMPOSITION
Development of
Evaluation
4
2
2
Evidence and
Synthesis and
20
Explanations
6
2
20
12
# Points
60
30
10
100
% Points
31
Appendix C
A3-A4 Bridge Take-Home Assessment
A3-A4 Bridge Take-Home Assessment (20 points)
INSTRUCTIONS: For the majority of the semester, you have been working closely with an issue in
higher education, reading background information in the A2 Annotated Bibliography then analyzing
relevant stakeholders who are affected by the issue in the A3 Stakeholder Analysis. This bridge
assessment is intended to help you identify important pieces from A3 and begin to assemble the A4
Academic Argument. These items are meant to test your knowledge and understanding of the previous
material, as well as help you further analyze your stakeholder options to help you evaluate and make your
choice of stakeholder for A4.
You are encouraged to use your class notes, my PowerPoint lectures, and your previous assignments to
complete this assessment. You have until class-time on Friday, April 7 to complete and submit a hard
copy of this assignment. You should write your responses for the first two items on the booklet provided.
1. Provide a definition of synthesis. Your response should not be a direct copy of course materials. (2
points)
2. List the answer to your research question from A3. In the T-charts provided for you below, name the
three stakeholders you analyzed from A3 and compare their stakes in the issue. On the left side of the
T-chart, list at least 2 positive aspects (gains), and on the right side list at least 2 negative aspects
(losses) for the stakeholder. Fill out the chart with 5 reasons total for each stakeholder. (6 points)
Answer to the Research Question (1 point):
_____________________________________________________________________________________
_________________________________________________________________
T-Charts (stakeholder: .33 point each, reasons: 1.33 points each):
Example:
Stakeholder: Aliens who come to Earth Stakeholder:
GAINS LOSSES GAINS LOSSES

-Find new -Potential war
allies
-Explore -Risk lives if

new land humans are
violent
-Become
famous
Stakeholder: Stakeholder:
GAINS LOSSES GAINS LOSSES

3. Based on the T-charts, choose the most appropriate stakeholder and the least appropriate
stakeholder for your issue. Compose one paragraph for each stakeholder (2 paragraphs,
which should be around one page total) explaining your choices. You may consider the
power the stakeholders have over the issue, how the stakeholder will be influenced by the
issue, how resistant they may be to your potential arguments, and/or the evidence you already
have that the stakeholder would find convincing. Refer to class notes, PowerPoints, and
handouts as necessary using MLA format for any in-text citations. A Works Cited page is not
needed. Type this response and submit with the previous two steps of this assignment. (12
points)

Key for A3-A4 Bridge Take Home Assessment
1. Provide a definition of synthesis. (2 points)
Partial credit possible

Supporting a similar claim using multiple sources
Demonstrating a connection between the information provided

2. List your research question and the stakeholders you analyzed from A3. Construct a T-chart (3
total) for each stakeholder, comparing their stakes in the issue. On the left side of the T-chart, list positive
aspects (gains), and on the right side list negative aspects (losses) for the stakeholder. Fill out the chart
with 5 reasons total for each stakeholder. (6 points)
Answer to the Research Question (1 point):
Answers will vary, based on A3 topics and issues. Partial Credit Possible for:
Full sentence required
Clarity and appropriateness of answer
T-Charts (stakeholder: .33 point x 3 = 1, reasons: 1.33 points/T-chart x 3 = 4 points):
Partial credit possible

Stakeholder: .33 point
Reasons (5 total): .27 points each, distribution between gains/losses does not matter
though the table distribution will likely lead to a maximum of 4 gains or 4 losses, forcing
at least one in each column

3. Based on your T-charts, choose the most appropriate stakeholder and the least appropriate stakeholder
for your issue. Compose one paragraph for each stakeholder (around one page total) explaining your
choices. You may consider the power the stakeholders have over the issue, how resistant they may be to
your potential arguments, and/or the evidence you already have that the stakeholder would find
convincing. Refer to class notes, PowerPoints, and handouts as necessary using MLA format for any in-
text citations. A Works Cited page is not needed. Type and print this response. (12 points)
Partial credit possible according to rubric.

Rubric for task 3, Extended Response
Question 3 Grading Criteria (12 points)
Unsatisfactory
Your instructor will ask the following questions when evaluating your work:
Satisfactory
Excellent
Score
Good
3 2.5 2 <1.5
Are stakeholder values stated clearly?
Audience and Rhetorical
Are the values for both stakeholders described logically?

Are the justifications for both stakeholders relevant to the issue?
Knowledge (6)
3 2.5 2 <1.5
Are the stakeholders appropriate in terms of power and capability to
affect change regarding the issue?
Are the explanations compelling and related to the issue?
Are rhetoric and audience appeals applied appropriately to justify the
choice of stakeholders?
2 1.66 1.33 <1

Are ideas logically organized within the paragraph?
Organization and
Development (4)
Is appropriate transitional language used within sentences?
2 1.66 1.33 <1

Do paragraphs have clear topic sentences, evidence, and explanations?
Are references used appropriately to help develop the paragraphs?
1 .83 .66 <0.5

How well does the essay follow MLA conventions?
Convention (2)
Style and
1 .83 .66 <0.5

How well has the writer proofread and edited for grammar, sentence-
structure and punctuation errors to make the essay clear and easily
readable?
*See the back of this page and the margins for additional commentary.
Final Score: ____/ 12
Score Reporting Form
1. Definition ___/2 points
2. T-charts
Answer to research question: __/1 point
Stakeholders: __/1 point
T-chart reasons: __/4 points
3. Extended Response ___/12 points
Question 3 Grading Criteria (12 points)
Unsatisfactory
Satisfactory
Your instructor will ask the following questions when evaluating your work:
Excellent
Score
Good
Are stakeholder values stated clearly? 3 2.5 2 <1.5
Audience and Rhetorical
Are the values for both stakeholders described logically?

Are the justifications for both stakeholders relevant to the issue?
Knowledge (6)
Are the stakeholders appropriate in terms of power and capability 3 2.5 2 <1.5
to affect change regarding the issue?
Are the explanations compelling and related to the issue?
Are rhetoric and audience appeals applied appropriately to justify
the choice of stakeholders?
Are ideas logically organized within the paragraph? 2 1.66 1.33 <1
Organization
Development
Is appropriate transitional language used within sentences?

and
(4)
Do paragraphs have clear topic sentences, evidence, and 2 1.66 1.33 <1
explanations?
Are references used appropriately to help develop the paragraphs?
How well does the essay follow MLA conventions? 1 .83 .66 <0.5
Convention (2)
Style and
How well has the writer proofread and edited for grammar, 1 .83 .66 <0.5
sentence-structure and punctuation errors to make the essay clear
and easily readable?
*See the back of this page and the margins for additional commentary.
4. Final Score: ____/ 20

(Cut Score = 60%, 12/20)

Assessment Development For ESL Composition

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Assessment Development For ESL Composition

Hochgeladen von

Copyright:

Verfügbare Formate

Running Head: ASSESSMENT DEVELOPMENT FOR ESL COMPOSITION 1

Assessment Development for ESL Composition: An Achievement Assessment for Formative

Yuanyuan Sun, Courtney Van Evera & Kiley Miller

Colorado State University

development. More specifically, in this paper, we focus on developing a take-home achievement

students performance demonstrating their learning outcomes by a certain point of instruction in

achievement at the end of the course.

Furthermore, Cumming (2002) pointed out a significant issue in academic-purpose

formatively assessing writing. Central to formative assessment are providing feedback,

Taxonomy in synthesizing sources and conducting evaluations through rhetorically critiquing

academic writing construct in a better way (Gebril & Plakans, 2014).

This paper provides a detailed overview of an achievement assessment for formative

Description of the Test

We developed the take-home assessment specifically for CO 150-I, the international

A2 prioritizes research and results in an annotated bibliography; A3 focuses on the people

another, this is particularly relevant to scaffolded summative assessments, and it serves to

assignment A3 is a stakeholder analysis and assignment A4 is an argumentative research

effective strategies for addressing purpose, audience, and context.

connect assignments A3 and A4, adding to the cohesion of the course.

The take-home assessment is designed to be an alternative assessment and an achievement

to avoid possible failure in assignments following the test.

The results of the take-home assessment is dealt with using criterion-referenced

Students individual performance on conducting specific tasks is described and evaluated

Specific Description of the TLU Domain

applied course concepts by choosing an appropriate or inappropriate stakeholder for a situation

CO 150-I course focuses primarily on understanding rhetorical elements to compose a

evidenced by a variety of aspects of communicative language ability.

Organizational knowledge. Students grammatical knowledge and understanding of

Pragmatic knowledge. Functional knowledge was assessed through their ability to

case, problem solving is the function at hand.

Assumed knowledge. Students were expected to write in formal, academic English, so

evaluated their written output.

representative sample of tasks is included in the process of developing an assessment (Jamieson,

indicates the amount of emphasis on each area in the assessment as well.

Blooms taxonomy of educational objectives including knowledge and understanding,

applied knowledge of audience (manipulative functions) covers 20%, organization (cohesion,

extended response makes up 60% of the score.

largest percentage of the score.

Description of Test Tasks

reflected the difficulty level.

Table of Specifications. For example, Synthesis and Evaluation is further defined as

reporting forms (see Appendix C) and returned them to students.

Individual Task Score Report

3 1.5 4.66 10.08 16.24

one standard deviation of Rater 3s scores.

Rater-Based Individual Scoring

Pearson Correlation Coefficients by Task

the extended response.

Extended Response: Item Statistics

Scoring Category Mean Standard Deviation

Audience and Rhetoric (6) 4.98 0.58

Organization (4) 3.27 0.40

Style and Conventions (2) 1.73 0.29

that each student achieved mastery of the content.

discussed around various characteristics of assessment procedures.

Critique of Task Performance

with the range of overall scores for students in the course.

Evaluation of Test Usefulness

effective...arguments; [and] Developing effective research and writing processes. Constructs

class was the TLU-domain.

assessment was highly authentic.

In terms of instruction, this formative assessment provided valuable positive impacts on

the short-comings revealed by this assessment.