Sie sind auf Seite 1von 4

Criterion-Referenced Tests

CRITERION-REFERENCED TESTS AND NORM-REFERENCED TESTS


RELIABILITY AND VALIDITY ASSESSMENT
SETTING PERFORMANCE STANDARDS
USES
A criterion-referenced test is a test that provides a basis for determining a candidate's level of
knowledge and skills in relation to a well-defined domain of content. Often one or more performance
standards are set on the test score scale to aid in test score interpretation. Criterion-referenced tests,
a type of test introduced by Glaser (1962) and Popham and Husek (1969), are also known as domainreferenced tests, competency tests, basic skills tests, mastery tests, performance tests or
assessments, authentic assessments, objective-referenced tests, standards-based tests,
credentialing exams, and more. What all of these tests have in common is that they attempt to
determine a candidate's level of performance in relation to a well-defined domain of content. This can
be contrasted with norm-referenced tests, which determine a candidate's level of the construct
measured by a test in relation to a well-defined reference group of candidates, referred to as the
norm group. So it might be said that criterion-referenced tests permit a candidate's score to be
interpreted in relation to a domain of content, and norm-referenced tests permit a candidate's score to
be interpreted in relation to a group of examinees. The first interpretation is content-centered, and the
second interpretation is examinee-centered.

CRITERION-REFERENCED TESTS AND NORM-REFERENCED TESTS


Because these two types of tests have fundamentally different purposes, it is not surprising that they
are constructed differently and evaluated differently. Criterion-referenced tests place a primary focus
on the content and what is being measured. Norm-referenced tests are also concerned about what is
being measured but the degree of concern is less since the domain of content is not the primary focus
for score interpretation. In norm-referenced test development, item selection, beyond the requirement
that items meet the content specifications, is driven by item statistics. Items are needed that are not
too difficult or too easy, and that are highly discriminating. These are the types of items that contribute
most to score spread, and enhance test score reliability and validity. With criterion-referenced test
development, extensive efforts go into insuring content validity. Item statistics play less a role in item
selection though highly discriminating items are still greatly valued, and sometimes item statistics are
used to select items that maximiz e the discriminating power of a test at the performance standards of
interest on the test score scale.
Some scholars have argued that there is little difference between norm-referenced tests and criterionreferenced tests, but this is not true. A good norm-referenced test is one that will result in a wide
distribution of scores on the construct being measured by the test. Without score variability, reliable
and valid comparisons of candidates cannot be made. A good criterion-referenced test will permit
content-referenced interpretations and this means that the content domains to which scores are
referenced must be very clearly defined. Each type of test can serve the other main purpose (normreferenced versus criterion-referenced interpretations), but this secondary use will never be optimal.
For example, since criterion-referenced tests are not constructed to maximiz e score variability, their
use in comparing candidates may be far from optimal if the test scores that are produced from the test
administration are relatively similar (see Hambleton & Zenisky, 2003).

RELIABILITY AND VALIDITY ASSESSMENT


Because the purpose of a criterion-referenced test is quite different from that of a norm-referenced
test, it should not be surprising to find that the approaches used for reliability and validity assessment
are different too. With criterion-referenced tests, scores are often used to sort candidates into
performance categories. Consistency of scores over parallel administrations becomes less central
than consistency of classifications of candidates to performance categories over parallel
administrations. Variation in candidate scores is not so important if candidates are still assigned to the
same performance category. Therefore, it has been common to define reliability for a criterionreferenced test as the extent to which performance classifications are consistent over parallel-form
administrations. For example, it might be determined that 80% of the candidates are classified in the
same way by parallel forms of a criterion-referenced test administered with little or no instruction in
between test administrations. This is similar to parallel form reliability for a norm-referenced test
except the focus with criterion-referenced tests is on the decisions rather than the scores. Because
parallel form administrations of criterion-referenced tests are rarely practical, over the years methods
have been developed to obtain single administration estimates of decision consistency (see, for
example, Livingston & Lewis, 1995) that are analogous to the use of the corrected split-half reliability
estimates with norm-referenced tests.
With criterion-referenced tests, the focus of validity investigations is on (1) the match between the
content of the test items and the knowledge or skills that they are intended to measure, and (2) the
match between the collection of test items and what they measure and the domain of content that the
tests are expected to measure. The alignment of the content of the test to the domain of content that
is to be assessed is called content validity evidence. This term is well known in testing practices.
Many criterion-referenced tests are constructed to assess higher-level thinking and writing skills, such
as problem solving and critical reasoning. Demonstrating that the tasks in a test are actually assessing
the intended higher-level skills is important, and this involves judgments and the collection of empirical
evidence. So, construct validity evidence too becomes crucial in the process of evaluating a criterionreferenced test.

SETTING PERFORMANCE STANDARDS


Probably the most difficult and controversial part of criterion-referenced testing is setting the
performance standards, i.e., determining the points on the score scale for separating candidates into
performance categories such as passers and failers. The challenges are great because with
criterion-referenced tests in education, it is common on state and national assessments to separate
candidates into not just two performance categories, but more commonly, three, four, or even five
performance categories. With four performance categories, these categories are often called failing,
basic, proficient, and advanced.
What makes the setting of performance standards on criterion-referenced tests controversial is that
the process itself is highly judgmental, and the implications are far-reaching. Candidates who fail the
test may be denied a high school diploma or a license to practice in the profession they trained for.
Teachers and administrators can lose their jobs if student test performance does not meet the
performance standards. Perceptions of the quality of education in a state can be affected by large
percentages of students being assigned to the failing or basic performance categories. With
international assessments such as Trends in Mathematics and Science Study (TIMSS), the educational
reputations of countries are based on criterion-referenced test performance.
The process of setting performance standards proceeds through many steps (see Ciz ek, 2001;

Hambleton & Pitoniak, 2006). First, it is common to set a policy about the composition of the panel that
will set the performance standards. Here, decisions about the demographic make-up of the panel,
such as gender, ethnicity, years of experience, geographical distribution, role (e.g., teachers,
administrators, curriculum specialists, parents), are usually considered, as well as other factors. Then a
plan is put in place to draw a representative panel to meet the specifications.
Another big decision concerns the choice of standard-setting method. There are probably 10 to 20
major methods, and large numbers of variations of each. The methods include Angoff, Ebel, Nedelsky,
contrasting groups, borderline groups, direct consensus, item cluster, booklet selection, extended
Angoff, bookmark, and more.
Prior to the meeting of the panel to set the performance standards it is common for a different panel to
prepare performance category descriptions. These descriptions lay out for the standard-setting panel
what it means to be a failing student, a basic student, and so on. The descriptions provide a basis for
the standard-setting panel to carry out its work of determining just how well candidates must perform
on the test to demonstrate basic, proficient, and advanced level performance. The descriptions are
also helpful in communicating what the expectations are for students in the performance categories,
and at the time of score reporting.
Next, the panel is brought together and the chosen method is applied to produce performance
standards. A typical panel meeting often begins with discussion of the purpose of the test and
exposure to the performance category descriptions. Having the panelists take a portion or even the
entire test is another activity that is included as part of the training. Then the method is introduced, and
practice is given prior to the panel starting on its task of setting the standards.
The meeting continues, and often two to three days are needed for the panelists to work through the
method and related discussions until a final recommended set of performance standards is produced.
Validity evidence is compiled about the process and the panelists' impressions of it, a technical
manual is often written, and then all of the information is forwarded to a board for setting the final
performance standards for the criterion-referenced test. If multiple tests are involved (e.g.,
mathematics, reading, and science tests at several grade levels), the task of making the complete set
of performance standards across subjects and grades consistent or coherent is especially
challenging.

USES
Criterion-referenced tests are used in many ways. Classroom teachers use them to monitor student
performance in their day-to-day activities. States find them useful for evaluating student performance
and generating educational accountability information at the classroom, school, district, and state levels.
The tests are based on the curricula, and the results provide a basis for determining how much is
being learned by students and how well the educational system is producing desired results.
Criterion-referenced tests are also used in training programs to assess learning. Typically pretestposttest designs with parallel forms of criterion-referenced tests are used. Finally, criterion-referenced
tests are used in the credentialing field to determine persons qualified to receive a license or
certificate. There are hundreds of credentialing agencies in the United States that are using criterionreferenced tests to make pass-fail credentialing decisions.
See also:Classroom Assessment

BIBLIOGRAPHY

Ciz ek, G. (Ed.). (2001). Setting performance standards: Concepts, methods, and perspectives.
Mahwah, NJ: Erlbaum. Glaser, R. (1963). Instructional technology and the measurement of learning
outcomes. American Psychologist, 18, 519521.
Hambleton, R. K., & Pitoniak, M. (2006). Setting performance standards. In R. L. Brennan (Ed.),
Educational measurement (pp. 4 334 70). Westport, CT: American Council on Education.
Hambleton, R. K., & Zenisky, A. (2003). Issues and practices of performance assessment. In C.
Reynolds and R. Kampaus (Eds.), Handbook of psychological and educational assessment of children
(pp. 3774 04 ). New York: Guilford.
Livingston, S., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on
test scores. Journal of Educational Measurement, 32, 179180.
Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of
Educational Measurement, 6, 19.
Copyright 2003-2009 The Gale Group, Inc. All rights reserved.

Das könnte Ihnen auch gefallen