Beruflich Dokumente
Kultur Dokumente
The more educators know about their students status with respect to educationally
relevant variables, the better will be the educational decisions that are made regarding those
students. To illustrate, if a junior high school teacher knows that Lee Lacey is a weak reader
and has truly negative attitudes toward reading, the teacher will probably decide not to send
Lee trotting off to the school library to tackle an independent research project based on self-
directed reading. Similarly, if a mathematics teacher discovers early in the school year that
her students know much more about mathematics than she had previously suspected, the
teacher is apt to decide that the class will tackle more advanced topics than originally planned.
Educators use the results of assessments to make decisions about students. But appropriate
educational decisions depend on the accuracy of educational assessment because, quite
obviously, accurate assessments will improve the quality of decisions whereas inaccurate
assessments will do the opposite. And thats what validity is about.
It is the validity of a score-based inference that is at issue when measurement folks deal
with validity. Tests, themselves, do not possess validity. Educational test are used so that
educators can make inferences about a students status. If high scores lead to one kind of
inference; low scores typically lead to an opposite inference. Moreover, because validity
hinges on the accuracy of our inferences about students status with respect to an assessment
domain, it is technically inaccurate to talk about the validity of a test. A well-constructed
test, if used with the wrong group of examinees or if administered under unsuitable
circumstances, can lead to a set of unsound and thoroughly invalid inferences. Test-based
inferences may or may not be valid. It is the test-based inference with which educators ought
to be concerned.
In real life, however, youll find a fair number of educators talking about the validity
of a test . perhaps they really do know its the validity of score-based inference that are at
issue, and theyre simply using a short-cut descriptor. Based on my experience, its more
likely that they really dont know the focus of validity should be on test-based inferences, not
on test themselves.
Now that you know whats really at issue in the case of validity, if you ever hear a
colleague talking about a tests validity youll have to decide whether you should preserve
that colleagues dignity by letting such psychometric short-comings go unchallenged. I
suggest that, unless there are really critical decisions on the line---- decisions that would have
an important educational impact on students----- you keep your insights regarding validation
to yourself. (when I first truly comprehended what was going on with the validity of test-
based inferences, I share this knowledge rather aggressively with several fellow teachers and,
thereby, meaningfully miffed them. No one, after all, likes a psychometric smart aleck.)
A junior high school English teacher, Cecilia Celina, has recently installed cooperative
learning groups in all five of her classes. The groups are organized so that although there are
individual grades earned by students based on each students specific accomplishments, there
is also a group-based grade that is dependent on the average performance of a students group.
Cecilia determines that 60% of a students grade will be based on the students individual
effort and 40% of the students grade will be based on the collective efforts of the students
group. This 60-40 split is used when she grades students written examinations as well as
when she grades a groups oral presentation to the rest of the class
Several of Cecilias fellow teachers have been interested in her use of cooperative
learning because they are considering the possibility of employing it in their own classes. One
of those teachers, Fred Florie, is uncomfortable about Cecilias 60-40 split of grading weights.
Fred believes that Cecilia cannot arrive at an accurate grade for an individual student when
theres 40% of the students grade based on the efforts of other students. Cecilia responds that
this aggregated grading practice is one of the key features of cooperative learning because it
is the contribution of the group grade that motivates the students in a group to help one
another learn. In most of her groups, for example, she finds that students help other group
members prepare for important examinations.
In thinking about the issue, because she respects Freds professional judgment, Cecilia
believes she has three decision options facing her. As she sees it, she can (1) leave matters at
they are, (2) delete all group-based contributions to an individual students grade, or (3)
modify the 60-40 split.
In the last chapter, we saw that there are three kinds of reliability evidence that can be used to
help us decide how consistently a test is measuring what its measuring. Well, and this should
come as no surprise, psychometricians also have isolated three kinds of evidence that can be
used to help educators determine whether their score-based inferences are valid. Rarely will
one set of evidence be so compelling that, all by itself, the evidence assures us that our score-
based inference is truly accurate. More commonly, several different sets of validity evidence
and several different kinds of validity evidence are needed for educators to be really
comfortable about the test-based inferences they make.
When I was preparing to be a high school teacher, many years ago, my teacher
education classmates and I were told that validity refers to the degree to which a test
measures what it purports to measure. (I really grooved on the definition because it gave me
an opportunity to use the word purport. Prior to that time, I didnt have too many occasions to
do do). Although, by modern standards, that traditional definition of validity is pretty
antiquated, it contains a solid dose of truth. If a test truly measures what it sets out to measure,
then its likely the inferences that we make about students based on their test performances
will be valid because we will interpret students performances according to what the tests
developers set our to measure.
Lets take a look, now, at the three kinds of evidence you may encounter in determining
whether the inference one makes from an educational assessment procedure is valid. Having
looked at the three varieties of validity evidence, Ill give you my opinion about what
classroom teachers really need to know about validity and what kinds of validity evidence, if
any, teachers need to gather regarding their own test. Table 3-1 previews the three types of
validity evidence well be considering in the remainder of the chapter.
Remembering that the more evidence of validity we have, the better well know how much
confidence to place in our score-based inferences, lets look a the first variety of validity
evidence--- namely, content-related evidence of validity.
Content-related evidence of validity ( often referred to simply as content validity) refers
to the adequacy with which the content of a test represents the content of the assessment
domain about which inferences are to be made. When the idea of content representatives was
first dealt with by educational measurement folks several decades ago, the focus was
dominantly on achievement examinations such as a test of students knowledge of history. If
educators thought that eighth-grade students ought to know 124 specific facts about history,
then the more of those 124 facts measured by a test, the more evidence there was of content
validity.
These days, however, the nation of content refers to much more than factual
knowledge. The content of assessment domains in which educators are interested can
embrace knowledge (such as historical facts), skill (such as higher-order thinking
competencies), or attitudes (such as students dispositions toward the study of science).
Content, therefore, should be conceived of broadly. When we determine the content
representativeness of a test, the content in the assessment domain being sampled can consist
of whatever is in that domain.
But what is adequate content representativeness and what isnt? although this is clearly
a situation in which more representativeness is better than less representativeness, lets
illustrate varying degrees with which an assessment domain can be represented by a test.
Take a look at figure 3-2 where you see an illustrative assessment domain (represented by the
shaded rectangle) and the items from different tests (represented by the dots). As the test
items coincide less adequately with the assessment domain, the weaker is the content-related
evidence of validity.
For examples, in illustration A of figure 3-2, we see that the tests items effectively
sample the full range of assessment-domain content represented by the shaded rectangle. In
illustration B, however, note that some of the tests items dont even coincide with the
assessment domains content, and that those items falling in the assessment domain dont
cover it all that well. Even in illustration C, where all the tests items measure content
included in the assessment domain, the breadth of coverage for the domain is insufficient.
Figure 3-2. Varying Degrees to Which a Tests Items Represent the Assessment Domain
about Which Score-Based Inferences Are to Be Made
Trying to put a bit of reality into those rectangles and dots, think about an Algebra 1
teacher who is trying to measure his students mastery of a semesters worth of content by
creating a truly comprehensive final examination. Based chiefly on students performances on
the final examination, he will assign grades that will influence whether or not his students can
advance to Algebra 2. Lets assume the content the teacher addressed instructionally in
Algebra 1---- that is, the algebraic skills and knowledge taught during the Algebra 1 course---
are truly prerequisite to Algebra 2. Then, if the assessment domain representing the Algebra 1
content is not satisfactorily represented by the teachers final examination, the teachers
score-based inferences about students end-of-course algebraic capabilities and his resultant
decisions hinge on students readiness for Algebra 2 are apt to be in error. If teachers
educational decisions hinge on students status regarding an assessment domains content,
then those decision are likely to be flawed if inferences about students mastery of the domain
are based on a test that doesnt adequately represent the domains content.
How do educators go about gathering content-related evidence of validity? Well, there
are generally two approaches to follow in doing so. Ill describe each briefly.
DEVELOPMENTAL CARE
One way of trying to make sure that a tests content adequately taps the content in the
assessment domain the test is representing is to employ a set of test-development procedures
carefully focused on assuring that the assessment domains content is properly reflected in the
assessment procedure itself. The higher the stakes associated with the tests use, the more
effort is typically devoted to making certain the assessment procedures content properly
represents the content in the assessment domain. For example, if a commercial test publisher
were trying to develop an important new nationally standardized test measuring high school
students knowledge of chemistry, it is likely that there would be much effort during the test-
development process to make sure that the appropriate sorts of skills and knowledge were
being measured by the new national chemistry test.
Listed here, for example, are the kinds of activities that might be carried out during the
test-development process for an important chemistry test to assure that the content covered by
the new test properly represents the content that high school students ought to know about
chemistry.
Note that the illustrative question is a two-component question. Not only should the
mathematical knowledge and/or skill involved, if un-mastered, require after-school remedial
instruction for the student but also the item being judged must appropriately measure that
knowledge and/or skill. In other words, the knowledge and/or skill must be sufficiently
important to warrant remedial instruction on the part of the student and the knowledge and/or
skill must be properly measured by the item. If the content of an item is important and if the
content of the item is also properly measured, the content review panelists should supply a
yes judgment for the item. If either an items content is not significant, or if the content is
badly measured by the item, then content review panelists should supply a no judgment for
the item.
By calculating the percentage of panelists who rate each item positively, an index of
content-related evidence of validity can be obtained for each item. To illustrate the process,
suppose we had a five-item test whose five items received the following positive per-item
ratings from a content review panel; item one, 72%; item two 92 %; item three 88%; item
four 98%; and item five 100%. The average positive content per-item ratings for the entire
test, then, would be 90%. The higher the average per-item ratings provided by a tests content
reviewers, the stronger the content-related evidence of validity.
In addition to the content review panelists ratings of the individual items, the panel can
also be asked to judge how well the tests items represents the domain of content that should
be assessed in order to determine whether a student is assigned to the after-schooled
remediation extravaganza. Presented here is an example of how such a question for content
review panelists might be phrased.