Koretz, Daniel. Measuring Up

Measuring Up
What Educational Testing Really Tells Us
BY DANIEL KORETZ hensive measures that we would ideally assessing the goals that we decide to
use, but that are almost always unavail- measure and that can be measured well,
Educational testing is ubiquitous in able to us. There are two reasons for the tests are generally very small samples of
America, and its importance is hard to incompleteness of achievement tests. The behavior that we use to make estimates
overstate. Tests have a powerful influence first, which has been stressed by careful of students’ mastery of very large
on public debate about many social developers of standardized tests for more domains of knowledge and skill.
concerns, such as economic competitive- than half a century, is that these tests can The accuracy of these estimates
ness, immigration, and racial and ethnic measure only a subset of the goals of depends on several factors, one of the
inequalities. And achievement testing education. Some goals, such as the most important being careful sampling of
seems reassuringly straightforward and motivation to learn, the inclination to content and skills. For example, if we
commonsensical: we give students tasks apply school learning to real situations, want to measure the mathematics
to perform, see how they do on them, the ability to work in groups, and some proficiency of eighth graders, we need to
and thereby judge how successful they or kinds of complex problem solving, are not specify what knowledge and skills we
very amenable to large-scale standardized mean by “eighth-grade mathematics.”
ILLUSTRATED BY PAUL ZWOLAK
their schools are.

This apparent simplicity, however, is testing. Others can be tested, but are not We might decide that this subsumes skills
misleading. considered a high enough priority to in arithmetic, measurement, plane
Test scores do not provide a direct and invest the time and resources required. geometry, basic algebra, and data analysis
complete measure of educational The second reason for the incompleteness and statistics, but then we would have to
achievement. Rather, they are incomplete of achievement tests—and the one that I decide which aspects of algebra and plane
measures, proxies for the more compre- will focus on here—is that even in geometry matter and how much weight
1 AMERICAN EDUCATOR | FALL 2008

should be given to each component (e.g., typical adolescent has a huge working are equally difficult.
do students need to know the quadratic vocabulary. Clearly, you will have to select What would be the impact of adminis-
formula?). Eventually, we end up with a a sample of words to put into your test. In tering my test rather than yours? Over a
detailed map of what the test should practice, you can get a reasonably good large enough number of applicants, the
include, often called “test specifications” estimate of the relative strengths of average score would not be affected at
or a “test blueprint,” and the developer applicants’ vocabularies by testing them all, because the two words in question are
writes test items that sample from it. on a small sample of words, if those words equally difficult. However, the scores of
But that is just the beginning. The are chosen carefully. Assume you will use some individual students would be
accuracy of a test score depends on a host 40 words, which would not be an unusual affected. Even among students with
of often arcane details about the wording number in an actual vocabulary test. comparable vocabularies, some would
of items, the wording of “distractors” The box below gives the first few know feckless but not parsimonious, and
(wrong answers to multiple-choice items), words from three lists that you could use vice versa.
the difficulty of the items, the rubric to select words for your test. This illustrates one source of measure-
(criteria and rules) used to score students’ ment error, which refers to inconsistency
work, and so on. The accuracy of a test in scores from one measurement to the
score also depends on the attitudes of the A B C next. To some degree, the ranking of your
test takers—for example, their motivation siliculose bath feckless applicants will depend on which words
to perform well. It also depends, as we vilipend travel disparage you select from list C, and if you tested
shall see later, on how schools prepare epimysium carpet minuscule applicants repeatedly using different
students for the test. If there are problems versions of your test, the rankings would
with any of these aspects of testing, the vary a little. Another source of measure-
results will provide misleading estimates Which list would you use? Clearly not ment error is the fluctuation over time
of students’ mastery of the larger domain. list A, which comprises specialized, very that would occur even if the items were
A failure to grasp this fact is at the root rarely used words. Everyone would receive the same. Students have good and bad
of widespread misunderstandings—and a score of zero or nearly zero, and that days. For example, a student might sleep
misuses—of test scores. It has often led would make the test useless: you would well before one test date but be too
policymakers astray in their efforts to gain no useful information about the anxious to sleep well another time. Or the
design productive testing and accountabil- relative strengths of their vocabularies. examination room may be overheated
ity systems. By placing too much emphasis List B is no better. Everyone would obtain one time but not the next. Yet another
on test scores, they have encouraged a perfect or nearly perfect score. There- source of measurement error is inconsis-
schools to focus instruction on the small fore you would construct your test from tencies in the scoring of students’
sample actually tested rather than the list C, which comprises words that some responses.
broader set of skills the mastery of which applicants would know and others not. Obviously, it’s important to try to keep
the test is supposed to signal. In this example, the fact that a test is measurement error to a minimum—and
To make the principles of testing merely a sample of a larger domain is that’s why test developers are so con-
concrete, let’s construct a hypothetical clear. But is sampling always as serious a cerned with reliability. Reliable scores
test. Suppose that you publish a magazine problem as it is in this contrived example? show little inconsistency from one
and have decided to hire a few college For the most part, yes.† The tests that are measurement to the next—that is, they
students as interns to help out. You of interest to policymakers, the press, and contain relatively little measurement
receive a large number of applicants and the public at large entail substantial error. Reliability is often incorrectly used
have decided that one basis for selecting sampling because they are designed to to mean “accurate” or “valid,” but it
from among them is the strength of their measure sizable domains, ranging from properly refers only to the consistency of
vocabularies. How do you determine that? knowledge acquired over a year of study measurement. A measure, including a test,
Conversations with them will help, but in a subject to cumulative mastery of can be reliable but inaccurate—such as a
may not be sufficient because they are not material studied over several years. scale that consistently reads too high.
uniform: a conversation with one Returning to the vocabulary test: what So when all is said and done, how
applicant may afford more opportunities would have happened if you had chosen justified would you be in drawing
for using advanced vocabulary than a words differently, while keeping them at conclusions about vocabulary from the
conversation with a second one. So you the same level of difficulty? To make this small sample of words on your test? This is
decide to construct a standardized test of concrete, assume that you selected all the question of validity, which is the single
vocabulary.* You would then confront a three of the words shown in list C, and most important criterion for evaluating
serious difficulty: although many teachers that I was also constructing a vocabulary achievement testing. In public debate, and
and parents may find this fact remarkable test, but I dropped feckless and used sometimes in statutes and regulations as
in the light of their own experience, the parsimonious instead. For the sake of well, we find reference to “valid tests,”
discussion, assume that these two words but tests themselves are not valid or
* People incorrectly use the term standardized test—
often with opprobrium—to mean all sorts of things: invalid. Rather, inferences based on test
multiple-choice tests, tests designed by commercial † There are tests that are not samples of a larger scores are valid or not. A given test might
firms, and so on. In fact, it means only that the test is domain. For example, a teacher may want to know provide good support for one inference,
uniform: that is, that all examinees face the same whether her class has mastered the list of vocabulary but weak support for another. For
tasks, administered in the same manner, and scored in words presented in the past week. She would not be
the same way. The motivation for standardization is to trying to draw any conclusions about students’ overall
example, a well-designed end-of-course
avoid irrelevant factors that might distort comparisons vocabularies, and she would be happy indeed if most exam in statistics might provide good
among individuals. students got most of the words right. support for inferences about students’
AMERICAN EDUCATOR | FALL 2008 2

mastery of basic statistics, but very weak enough to it to undermine the represen- unambiguously bad. But what about
support for conclusions about mastery of tativeness of the test—illustrates the reallocation, alignment, and coaching? All
mathematics more broadly. The question contentious issue of score inflation, which three can produce real gains, score
to ask is: how well supported is the refers to increases in scores that do not inflation, or both. Reallocation refers to
conclusion? signal a commensurate increase in shifting instructional resources—classroom
None of the preceding is particularly proficiency in the domain of interest. time, homework, parental nagging,
controversial. These fundamentals of Inflation of scores in this case did not whatever—to better match the content of
testing may not be well known outside require any flaw in the test, and it did not a specific test. A quarter century of studies
the testing community, but inside that require that the test focus on unimportant confirm that many teachers reallocate
community they are widely agreed upon. material. The 40 words were fine. My instruction in response to tests. And some
The next and final step in this hypotheti- response to those 40 words—my form of studies have found that school administra-
cal exercise, however, is contentious test preparation—was not. tors reassign teachers to place the most
indeed. In real-world testing programs, issues effective ones in the grades in which
Suppose you are kind enough to share of score inflation and test preparation are important tests are given.1
with me your test of 40 words. And far more complex than this example Is reallocation good or bad? Does it
suppose I intercept every single applicant suggests. So let’s set aside our vocabulary generate real gains in achievement or
en route to taking your test, and I give test and take a closer look at what I score inflation? This depends on what gets
each one a short lesson on the meaning believe should be a very serious concern more emphasis, and what gets less. Some
of every word on your test. What would among educators and policymakers: how reallocation is desirable and is one of the
happen to the validity of inferences you to prepare for tests. goals of testing programs. For example, if
might want to base on your test scores? Test preparation has been the focus of a ninth-grade math test shows that
Clearly, your conclusions about which intense argument for many years, and all students do relatively poorly in solving
applicants have stronger vocabularies sorts of different terms (like “teaching the basic algebraic equations, one would
would now be wrong. Most students test” and “teaching to the test”) have want their teachers to put more emphasis
would get high scores, regardless of their been used to describe both good and bad on such equations. The rub is that
actual vocabularies. Students who paid forms. I think it’s best to ignore all of this devoting more resources to topic A entails
attention during my mini-lesson would and to distinguish instead between seven fewer resources for topic B.
outscore those who did not, even if their different types of test preparation: (1) Scores become inflated when topic
actual vocabularies were weaker. Mastery working more effectively, (2) teaching B—the material that gets less emphasis as
of the small sample of 40 words would no more, (3) working harder, (4) reallocation, a result of reallocation—is also an
longer represent variations in the (5) alignment, (6) coaching students, and important part of the domain. If teachers
students’ actual working vocabularies. (7) cheating. respond to a test by de-emphasizing
This last step—teaching the specific The first three are what some propo- material that is important to the domain
content of the test, or material close nents of high-stakes testing want to see. but is not given much weight on the
Clearly, if educators find ways to particular test, scores will become inflated.
work more effectively—for Performance will be weaker when
example, developing better students take another test that places
curricula or teaching methods— emphasis on those parts of the domain
students are likely to learn more. that have been neglected.
Up to a point, if teachers spend Alignment is a lynchpin of policy in this
more time teaching, achieve- era of standards-based testing. Tests
ment is likely to rise. The same is should be aligned with standards, and
true of working harder in instruction should be aligned with both.
school, although this can be And alignment is seen by many as
carried too far. For example, it is insurance against score inflation, but this
not clear that depriving young is incorrect. Alignment is just reallocation
children of recess, which some by another name. Whether alignment
schools are now doing in an inflates scores also depends on the
effort to raise scores, is effective, importance of the material that is
and in my opinion it is undesir- de-emphasized. And research has shown
able regardless. Similarly, if that standards-based tests are not immune
students’ workload becomes to this problem. These tests are still
excessive, it may interfere with limited samples from larger domains, and
learning and may also generate therefore focusing too narrowly on the
an aversion to learning. But if content of the specific test can inflate
not carried to excess, these three scores.
forms of test preparation can be Coaching students refers to focusing
expected to produce real gains in instruction on small details of the test,
achievement that would appear many of which have no substantive
not only in the test scores used meaning. Coaching need not inflate
for accountability, but on other scores. If the format or content of a test is
tests and outside of school as well. sufficiently unfamiliar, a modest amount
At the other extreme, cheating is of coaching may even increase the validity
3 AMERICAN EDUCATOR | FALL 2008

of scores. For example, the first time know less about those types of equations can do, but they don’t describe all they
young students are given a test that than their performance on the test can do, and they don’t explain why they
requires filling in bubbles on an answer indicates. can or cannot do it. Use scores as a
sheet that is going to be scored by a So what distinguishes good and bad starting point, and look for other evidence
machine, it is worth spending a very short test prep? The acid test is whether the of school quality—ideally not just other
time familiarizing them with this proce- gains in scores produced by test prepara- aspects of student achievement but also
dure before they start the test. tion truly represent meaningful gains in the quality of instruction and other
Most often, however, coaching student achievement. We should not care activities within the school. And go look
students either wastes time or inflates very much about a score on a particular for yourself. If students score well on
scores. A good example is training test. What we should be concerned about math tests but appear bored to tears in
students to use a process of elimination in is the knowledge and skills that the test math class, take their high scores with a
answering multiple-choice questions. A score is intended to represent. Gains that grain of salt, because an aversion to
Princeton Review test-prep manual urges are specific to a particular test and that do mathematics will cost them later in life,
students to do this because “it’s often not generalize to other measures of the even if their eighth-grade scores are good.
easier to identify the wrong answers than domain and to performance in the real Sensible and productive uses of tests
to find the correct one.”2 What’s wrong world are worthless. and test scores rest on a single principle:
with this? The performance gains * * * don’t treat “her score on the test” as a
generated depend entirely on using multi- This brings me to a final, and politically synonym for “what she has learned.” A
ple-choice items. Of course, when unpalatable, piece of advice: we need to test score is just one indicator of what a
students need to apply their knowledge be more realistic about using tests as a student has learned—an exceptionally
in the real world outside of school, the part of educational accountability systems. useful one in many ways, but nonetheless
tasks are unlikely to appear in the form of Systems that simply pressure teachers to one that is unavoidably incomplete and
a multiple-choice item. raise scores on one test (or one set of tests somewhat error-prone. ☐
This example shows that inflation from in a few subjects) are not likely to work as
coaching is in one respect unlike inflation advertised, particularly if the increases Endnotes
from reallocation. Reallocation inflates demanded are large and inexorable. They 1. For a good overview of some of the most important research on
teachers’ and principals’ responses to testing, see Brian M. Stecher,
scores by making performance on the test are likely instead to produce substantial “Consequences of Large-Scale, High-Stakes Testing on School and
unrepresentative of the larger domain, inflation of scores and a variety of Classroom Practice,” in Making Sense of Test-Based Accountability
in Education, ed. Laura S. Hamilton, Brian M. Stecher, and Stephen
but it does not distort performance on undesirable changes in instruction, such as P. Klein (Santa Monica, CA: Rand, 2002), http://www.rand.org/
the material tested. (If I taught applicants excessive focus on old tests, inappropriate pubs/monograph_reports/MR1554.
the vocabulary words on your test, they narrowing of instruction, and a reliance 2. Jeff Rubenstein, Princeton Review: Cracking the MCAS Grade 10
would know those words—but their on test-taking tricks. Math (New York: Random House, 2000), 15.
scores on the test would not be good I strongly support the goal of improved
estimates of their overall vocabulary accountability in public education. I saw
knowledge.) In contrast, coaching can the need for it when I was an elementary
exaggerate performance on the tested school and junior high teacher, many years
material. In the example just given, ago. I saw it as the parent of two children
students who are taught to use the in school. Nothing in more than a quarter
process of elimination as a method for century of education research has led me
“solving” certain types of equations will to change my mind on this point. And it
seems clear that student achievement
must be one of the most important
This article, which originally appeared things for which educators and school
in the Fall 2008 issue of American systems should be accountable. However,
Educator, was adapted from Daniel we need an effective system of account-
Koretz’s book, Measuring Up: What ability, one that maximizes real gains
Educational Testing Really and minimizes bogus gains and other
Tells Us. Detailed but negative side effects. Even
nontechnical, the book a very good achievement
addresses the common test will leave many aspects
misunderstandings and of school quality unmea-
misuses of standardized sured. Some hard-core
tests, and offers sound advocates of high-stakes
advice for using tests testing disparage this
responsibly. To learn more, argument as “anti-testing,”
go to www.hup.harvard. but it is a simple statement of
edu/catalog/KORMAK. fact, one that has been
html. Measuring Up, recognized within the testing
copyright © 2008 by the profession for generations.
President and Fellows of So how should you use
Harvard College, is available scores to help you evaluate a
from all major booksellers. school? Start by reminding yourself that
scores describe some of what students
AMERICAN EDUCATOR | FALL 2008 4

Koretz, Daniel. Measuring Up

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Koretz, Daniel. Measuring Up

Hochgeladen von

Copyright:

Verfügbare Formate

Measuring Up

What Educational Testing Really Tells Us

their schools are.

1 AMERICAN EDUCATOR | FALL 2008

AMERICAN EDUCATOR | FALL 2008 2

3 AMERICAN EDUCATOR | FALL 2008

AMERICAN EDUCATOR | FALL 2008 4

Das könnte Ihnen auch gefallen