Beruflich Dokumente
Kultur Dokumente
INTRODUCTION
National trends have pretty well been set for the 1990s. They indicate continued
emphasis on testing, at least as far as standardized test are concerned, with expanded
uses of test result.
The local school continues to be the implementing agency for assessment programs,
typically those mandated at the state level but also influenced by national trends.
Measurements may differ in the amount of information the numbers contain. These
differences are distinguished by the terms nominal, ordinal, interval, and ratio scales of
measurements.
3. Interval scales categorize, order, and establish an equal unit in the scale.
4. Ratio scales categorize, order, establish an equal unit, and contain a true zero
point.
Review Items
a. Predominately norm-referenced.
b. Predominately criterion-referenced.
2. When state legislators have mandated testing, such legislation has been focused
primarily on:
5. Of the terms below, the one having the narrowest meaning is:
a. Evaluation.
b. Assessment.
c. Measurement.
d. Test.
a. Interval.
b. Nominal.
c. Ordinal.
d. Ratio.
8. If, when reading information about a class, we assign a 1 to girls and a 0 to boys,
this level of measurement is:
a. Interval.
b. Nominal.
c. Ordinal.
d. Ratio.
9. The scores from a typical classroom test are probably better that
____________measurement, but not quite___________.
a. Ordinal, nominal.
b. Interval, ratio.
c. Nominal, ordinal.
d. Ordinal, interval.
10. When using a performance test, the difference between scores of 65 and 70 equals
the difference between scores of 85 and 90, but a score of 80 is not twice a score of
40. The level of measurement is:
a. Interval
b. Nominal.
c. Ordinal.
d. Ratio.
T F
T F
T F
T F
-2-
PLANNING THE TEST
Tests are given for many different reasons. In order to achieve such diverse purposes,
they need to be carefully planned. In classroom settings, this planning usually entails
instructional objectives and/ or a table of specifications.
When teachers use specific instructional objectives, it becomes very clear what should
be on the test. The desired student behaviors from the objectives translate directly into
the items for the test.
The use of objectives or a table of specifications may imply strict guidelines for
constructing tests, but much of testing is determined by practical considerations. Test
constructors must consider such things as how much testing time is available, what item
formats the students can handle, and the developmental level of the examinee. A well-
planned test is no accident.
Review Items
1. The bloom et al. taxonomy for educational objectives in the cognitive domain is a
hierarchy based on the:
a. Extent of recall required to attain an objective.
b. Level of understanding required to attain an objective.
c. Reading level required to attain an objective.
d. Aptitude required to attain an objective.
2. Tests are constructed for many purposes. Which of these is not a common
purpose of tests?
a. Affirmation.
b. Prediction.
c. Evaluation.
d. Description.
3. A test is useful for one purpose is also likely to be effective for many other
purposes.
T F
4. Test items should be evaluated in terms of how well they match the test’s:
a. Purpose.
b. Reading level.
c. Reliability.
d. Assessment.
5. Explaining in one’s own words what a compromise entails is task at what level of
Bloom’s taxonomy?
a. Application.
b. Knowledge.
c. Comprehension.
d. Analysis.
7. Matching authors’ names and titles of their books is a task at what taxonomic
level?
a. Application.
b. Knowledge.
c. Comprehension.
d. Analysis.
8. A knowledge-level understanding of reproduction is necessary but not sufficient
for a comprehension level of understanding of reproduction.
T F
10. Which of the following educational goals is not stated in behavioral terms?
a. Read.
b. Understand.
c. List.
d. Count.
13. A table of specification is not useful when instructional objectives are available.
T F
14. A table of specifications would be appropriate for an achievement test but not for
a test that is used to predict future academic performance.
T F
-3-
SELECTED-RESPONSE ITEMS
Test items can be distinguished by the response required-the response is selected from
two or more options or constructed by the test taker.
True-false can be effective when a few guidelines are followed in the construction:
6. True and false items should be of the same frequency and length.
Correct answer should be randomly ordered the response-option position. The position
of the correct response should not provide a clue about the correctness of the response.
Complex options (e.g., all of the above; none of the above; or a and b, but not c) should
be used sparingly, if at all.
Matching items are usually presented in a two-column format: one column consists of
premises and the other consists of responses.
Matching items should contain homogeneous content so that all responses must be
considered plausible answers.
1. Teachers should be aware of the appropriate number of items on the test that can be
guessed correctly.
2. Test items should be independent: The content of one item should not provide the
answers to others, nor should correctly answering one question be a prerequisite to
correctly answering another.
3. Reading level of the test should be lower than the grade level, unless reading is
being tested.
Review Items
1. Objective items are objective only in their:
a. Item content.
b. Scoring.
c. Distracters.
d. Wording.
2. Many selected-response items can be asked in each testing session; thus, they can
provide good:
a. Levels of difficulty.
b. Objectivity.
c. Content sampling.
d. Time sampling.
7. A matching item includes six events to be matches with nine responses consisting
of dates, cities, and states. The error of item constructing is:
a. Too many premises.
b. Too few premises.
c. Responses contain heterogeneous content.
d. Responses contain homogeneous content.
9. The tendency for true-false items to measure trivia is a weakness in the item writer
more that the item format:
T F
11. Increasing the length of a matching item tends to enhance the homogeneity of
content.
T F
12. Multiple-choice items generally require about the same response time per item as
matching items.
T F
13. The items format most appropriate for measuring knowledge of paired associates,
such as symbols and their meaning, is multiple-choice.
T F
14. The column of a matching item that contains the items stems is called the_______.
15. For usual classroom testing, the most desirable length for a matching item is
between ___________and___________premises.
-4-
CONSTRUCTED-RESPONSE ITEMS
For a short-answer item, the student supplies the answer to a question, association, or
completion form.
In constructing short-answer items, each item should have a unique, correct answer be
structures so the student can clearly recognize its intent.
An essay item is one for which the student structures the response. He or she selects
ideas and then presents them according to his or her own organization and wording.
Essay items are used quite effectively to measure higher-level learning outcomes, such
as analysis, synthesis, and evaluation. Essay testing is not, however, an effective
means of measuring lower-level learning outcomes.
Essay items can be used to measure writing and self-expression skills. Although this
may not be the primary purpose of a given test, it is certainly worthwhile.
The student must be directed to the desired response. This can be enhanced by
identifying the intended student behaviors and including them in the essay item.
The suggested time for responding to each test item should be provided to the students.
This designates the weight or value of each item and also helps students budget their
time.
Review Items
1. In an association form, short-answer item, the spaces for the responses should:
a. Vary according to the length of the correct response.
b. All be the same size.
c. Vary in size, but not according to any order.
d. Vary in size according to some system of ordering.
6. The test scorer reads the response to one essay item after already reading several
other responses to the same item. The score of this response will tend to be:
a. Higher, if the earlier responses were of poor quality.
b. Higher, if the earlier responses were of high quality.
c. Lower, if the earlier responses were of poor quality.
d. Unaffected by the quality of earlier responses.
7. The halo effect in scoring items is a tendency is to score more highly those
response:
a. Read later in the scoring process.
b. Read earlier in the scoring process.
c. Of students known to be good students.
d. That are technically well written.
8. A student receives a high score on an essay item, due, in part, to the quality of
responses to the item read earlier. This is:
a. A context effect.
b. A hallo effect.
c. A reader-agreement effect.
d. None of the above.
10. From a measurement standpoint, using classroom test consisting entirely of essay
items is undesirable because:
a. Content sampling tends to be limited.
b. Scoring requires too much time.
c. It is difficult to construct the items.
d. Structuring model responses is too time consuming.
11. Short-answer items are generally easier to construct than matching items.
T F
12. The use of essay items is an effective means of measuring lower-level learning
outcomes.
T F
13. Analytic scoring of essay items tends to be faster than holistic scoring.
T F
14. Analytic scoring of essay items tends to be more objective than holistic scoring.
T F
15. There is a tendency to score longer responses to essay items more highly than
shorter response.
T F
Representativeness of the norm group depends on the size of the sample and the
sampling method. The latter has numerous factors associated with it and is the most
likely source of producing biased norms.
Relevance depends on the degree to which the norm group is comparable to the group
under consideration.
National, local, and subgroup norms provide different perspectives for interpreting the
result of tests.
Norms are measure of the actual performance of a group on a test. They are not meant
to be standards of what performance levels should be.
Descriptive statistics are used to summarize characteristics of sets of test scores. The
level of statistics commonly used in measurement is quite basic, requiring only simple
arithmetic operations.
Frequency distributions summarize sets of test scores by listing the number of people
who received each test score. All of the test scores can be listed separately, or the
scores can be grouped in a frequency distribution.
The mean, the median, and the mode all describe central tendency:
Descriptive statistics that indicate dispersion are the range, the variance, and the
standards deviation. The range is the difference between the highest and lowest scores
in the distribution plus one. The standard deviation is a unit of measurement that shows
by how much the separate score tend to differ from the mean. The variance is the
square of the standard deviation. Most scores are within two standard deviations from
the mean.
Review Items
1. When using a norm-referenced interpretation, the student’s score on a test is
compared.
a. A minimum score for passing the test.
b. The score of others taking the test.
c. As expected score based on the student’s ability.
d. A predetermined percentage of correct responses.
2. It is not important that the norm group for a nationally used achievement test:
a. Is large.
b. Is representative.
c. Is from at least three grade levels.
d. Has persons from all states.
3. The extent to which a norm group is comparable to the group being tested
determines the norm group’s
a. Relevance.
b. Representativeness.
c. Recensy.
d. Reliability.
4. The “Lake Wobegon Phenomenon” in testing is the situation of:
a. Norm groups scoring unusually high on standardized tests.
b. Students scoring below the national average on standardized tests.
c. Students scoring above average on locally normed.
d. All states reporting above-average performance on nationally normed tests.
5. Norms for published standardized tests are commonly based on the performance
of:
a. Individual students who will perform well.
b. One or more groups of students.
c. Students in a typical school system.
d. A random sample of students from one state.
6. Local achievement and aptitude norms might be more important than national
norms is decisions about:
a. Future occupation
b. The likelihood of success in certain colleges.
c. Selection into special high-school programs.
d. Allocations among different school district.
7. The central administration of a school district sets a goal of having all elementary-
school students reading at or above the average on a nationally normed test. The
mistake being made is:
a. Making the assumption that the norm group is relevant to the local school.
b. Using the norm as a standard.
c. Attempting to have consistent reading performance in all schools.
d. Establishing too modest a goal.
9. When a distribution has a small number of scores, some of which are very extreme,
the preferred measure of central tendency is the:
a. Median
b. Mean
c. Range
d. Mode
10. A measure of dispersion for a distribution, whose computation involves only the
extreme scores, is:
a. Standard deviation
b. Variance
c. Mode
d. Range
11. Which of the following provides a mesure of dispersion in the same units as the
original scores?
a. Variance
b. Median
c. Standard deviation
d. Correlation
12. Measure of central tendency are to location as measures of dispersion are to:
a. Points
b. Spread
c. Average
d. Frequencies
13. If the mode and median of a distribution of scores are equal, the mean will also
have to be equal to the median.
T F
14. When establishing national norms, size of the norm group is a major concern.
T F
15. Generally, the larger the numerical value of the median, the larger the value of the
standard deviation.
T F
-6-
COMPARING SCORES TO NORM GROUPS
When comparing an individual’s score to the scores of the norm group, the point is to
determine where the individual’s score locates in the norm group distribution.
Percentiles indicate the percentage of students in the norm group who are at or below a
particular score.
The standard normal distribution has a mean of 0 and a standard deviation of 1.0. The
area in the Appendix 4 table is given from the mean to the z-score and it is the
proportion of the total area.
Standard score and transformed standard scores express the relative position of a
score in a distribution in terms of standard deviation units from the mean.
Stanines provide equal units of measurement. There are nine stanine scores and the
name comes from “standard nine.” Each stanine contains a band of scores, each band
equal to one-half standard deviation in width.
The NCE score is a normalized standard score with a mean of 50 and a standard
deviation of 21.06. Scores ranga from 1 through 99 and an equal unit is retained in the
scale.
Grade equivalent scores are intended to indicate the average level of performance for
students in each month of each grade. Unfortunately, grade equivalents do not from an
equal interval scale.
Percentile Stanines
Percentile rank Normalized T-score
Standard score Normal curve equivalents score
Standard normal distribution Grade equivalent score
Transformd standard score
Review Items
1. A student who scores at the 45th percentile on a test:
a. Answered 45 percent if the items correctly.
b. Is above average in performance.
c. Equaled or surpassed 45 percent of the other examinees.
d. Had at least 45 percent of the right answer.
3. A test score that is at the 42nd percentile could also be said to be at which stanine?
a. 3rd
b. 4th
c. 5th
d. 6th
5. Which of these cannot be meaningfully average because the scores are ordinal
rather than interval?
a. Percentiles
b. Stanines
c. Standard scores
d. None of the above
10. In a normal distribution, which of the following indicates the highest relative position
in the distribution of scores?
a. Z = 1.5
b. Percentile rank = 90
c. T = 65
d. Stanine = 8
11. Joe has a stanine score of an exam. Hid performance is:
a. Below the 6th percentile
b. Between the 60th and 77th percentile
c. At the 50th percentile
d. Above the 80th percentile
13. Percentiles are more of an ordinal scale than an equal interval scale.
T F
14. When score are converted to percentiles, a specified gain in achievement will result
in a larger increase in percentile rank if the gain is near the high end of the
distribution than near the middle of the distribution.
T F
-7-
ITEM STATISTICS FOR NORM-REFERENCED
TESTS
Constructing a perfect test is not likely, especially for the initial draft of the best, even
when we follow the guidelines for good test construction. Confusion, ambiguity, and
poorly constructed options may enter into an item. Students may perceive items
differently than intended by the teacher. Item analysis provides empirical data about
how individual items are performing in a real test situation. Item statistics do not reveal
specifically the deficiencies in the content of items, but they indicate when an item is
deficient. Checking the item difficulty index and the discrimination index may give some
clues as to what is wrong. A careful inspection of the item content and response
patterns of students is often quite revealing.
Review Items
1. To compute the correlation between attitude and achievement, one must have:
a. Achievement score from one group of people and attitude scores from another
group.
b. Achievement and attitude scores on the same group of people.
c. Achievement scores from two points in time and attitude scores from two points
in time.
d. The same tests given twice to the same group of people.
5. Students who can high on an ability measure were found to be able to solve a
learning task much faster than students scoring low on the ability measure. If scores
on the ability measure and time to compete the learning task are correlated, we
would expect:
a. Zero correlation
b. A zero coefficient of determination
c. Positive correlation
d. Negative correlation
9. If an item has a high discrimination index, it means that scores on the item have:
a. No correlation with total test scores
b. High correlation with total test scores
c. Low correlation with total test scores
d. Negative correlation with total test scores
10. An item has a negative discrimination index. Thus, if the student responds correctly
to this item, for this student we would expect a:
a. Low total test score
b. High total test score
c. Total test score around the middle
d. Total test score of zero
11. Of the following, which provides information about the distribution of total test
score?
a. Difficulty index
b. Correlation coefficient
c. Discrimination index
d. Standard deviation
12. If we want to identify who is getting the test item correct, low-scores of high-scores,
we would check the difficulty index.
T F
13. An item has a discrimination index around 8. This means that scores on the test are
getting the item correct.
T F
14. An item has a difficulty index close to zero. This means that high-scores on the test
are getting the item correct.
T F
15. The ideal situation for a test is to have high difficulty levels and high discrimination
indices for the items.
T F
-8-
RELIABILITY OF NORM-REFERENCED TESTS
Reliability of measurement is consistency—consistency in measuring whatever the
instrument is measuring.
Test-retest, with the same test administered at different times, provides the estimate of
estimate of stability reliability. The reliability coefficient is the correlation between the
scores of the two test administrations.
The split-half procedure divides the test into two parallel halves; the score of the two
halves are then correlated. The reliability of the total test is then estimated using the
Spearman-Brown formula.
The r21 may be substituted for r20 if item difficulty levels are similar; r21 is computationally
easier, but it underestimates reliability if the items vary in difficulty.
The length affects reliability in such a way that, the longer the test, the greater the
reliability, assuming other factors remain constant.
The Spearman-Brown formula is used for estimating the reliability of increased length. It
is applied when using the split-half procedure since the total test is twice as long as the
individual halves.
Difference scores tend to be less reliable than scores on individual tests. As the
correlation between the scores on the two tests creases, the reliability of the difference
scores decrease.
An observed test score may be considered as consisting of two parts, the true
component and error component.
The standard error of measurement is the standard deviation of the distribution of error
scores. As reliability increase, the standard error of measurement decrease.
We can use the concepts of reliability and standard error of measurement in making
inferences about how an individual’s score would fluctuate on repeated use of the same
test. The distribution of scores would have a mean approaching the individual’s true
score and a standard deviation equal to the standard error of measurement.
Item similarity enhances reliability and item difficulty affect reliability such that items of
moderate difficulty, around 50 percent correct responses per item, enhance reliability.
Review Items
1. The reliability coefficient can take on values:
a. From 0 to +1.00, inclusive
b. From – 1.00 to + 1.00, inclusive
c. Of any positive number
d. From – 1.00 to 0, inclusive
2. If a group of student was measured in September using a mathematics
achievement test and then tested again in October using the same test, the
correlation coefficient between the scores of the two test administrations would be
a measure of:
a. Stability reliability
b. Equivalent reliability
c. Interval consistency reliability
d. Both stability and equivalence reliability
5. On a given test, the observed standard deviation of the scores is 20, and the
reliability of the test is 84. The standard error of measurement is:
a. 8.00
b. 18.33
c. 3.20
d. 16.80
6. A test of 40 items has reliability of 70. If the test is increased to 80 items, the
reliability will be:
a. 0.99
b. 0.54
c. 0.82
d. 0.90
10. In applying the split-half procedure for estimating reliability, the reliability coefficient
for one-half the test is computed. To estimate the reliability of the entire test, we use
the:
a. Kuder-Richardson formula-20
b. Kuder-Richardson formula-21
c. Spearman-Brown formula
d. Cronbach alpha procedure
12. A test of 100 items is divided into five subtests of 20 items each. If we are interested
in internal consistency reliability, the most appropriate procedure for estimating
reliability is:
a. Kuder-Richardson formula-20
b. Kuder-Richardson formula-21
c. Cronbach alpha
d. Parallel forms
13. A mathematics test is given to a class of gifted students and also to a regular
ungrouped class. The reliability of the test would likely:
a. Greater for the gifted class
b. Greater for the ungrouped class
c. About the sane for both classes
d. Unable to infer anything until the reliability coefficient is computed
16. Theoretically, with respect to variance, reliability can be considered the ratio of:
a. Observed variance to true variance
b. Error variance to observed variance
c. True variance to error variance
d. True variance to observed variance
17. Conceptually, the true component and the error component of a test score are such
that:
a. The greater the true component, the greater the error component
b. The greater the true component, the smaller the error component
c. The component are equal
d. The component are independent
18. In conceptualizing the distributions of observed, true, and error scores, the following
is true for the means:
a. The observed mean equals the true mean
b. The error mean equals zero
c. The observed mean equals the true mean plus the error mean
d. All of the above
19. Conceptually, the variances of the distributions of the observed, true, and error
score are such that:
a. The variance of the error scores is zero
b. The error variance plus the true variance equal the observed variance
c. The observed variance is less than the true variance
d. The observed variance and the true variance are equal
20. A difference score is generated by subtracting a pretest score from a posttest score.
In order to obtain a high reliability for the difference score, we require:
a. Low correlation between pretest and posttest scores
b. High reliability for both pretest and posttest scores
c. Both a and b
d. Neither a nor b
-9-
VALIDITY OF NORM-REFERENCED TESTS
Validity is the correct to which a test measure what it is intended to measure.
Content validity is concerned with the extent to which the test is representative of a
defined body of contact consisting of topics and process.
Content validity is based on a logical analysis. It does not general a validity coefficient,
as is obtained with some other types of validity.
Standardized achievement tests tend to have broad content coverage so they will have
wide application. However, when used in a specific situation, the content validity of e
prospective test should always be considered.
Criterion validity is based on the correlation between scores on the test and scores on a
criterion. The correlation coefficient is the criterion validity coefficient.
Concurrent validity is involved if the scores on the criterion are obtained at the same
time as the test scores. Predictive validity is involved if the scores on the criterion are
obtained after an intervening period from those of the best.
Concurrent validity applies if it is desirable to substitute a shorter test for a longer one.
In that case, the score on the longer test is the criterion, and validity is that of the
shorter test.
The construct validity of a measure or test is the extent to which scores can be
interpreted in terms of specified traits or construct.
For test validated through correlation with a criterion measure, validity can be expressed
as the proportion of the observed test variance that is common variance with the
criterion. The validity coefficient is the square root of this proportion or ratio.
A well- constructed test with items of proper difficulty level will enhance validity. Validity
tends to increase with test length. Low-item inter correlation may tend to enhance
criterion validity if we have a complex criterion.
Increased heterogeneity of the group measure tends to enhance validity. Subtle,
individual factors may also affect validity. Tests should be properly administrated, since
any procedures that impede performance also lower validity.
Review Items
1. Which of the following types of validity does not yield a validity coefficient?
a. Predictive
b. Concurrent
c. Content
d. Criterion
2. When considering the terms reliability and validity, as applied to a test, we can say:
a. A valid test ensures some degree of reliability
b. A reliable test ensure some degree of validity
c. Both a and b
d. Neither a nor b
4. If scores on a high-school academic aptitude test are highly and positively related to
freshman GPA in collage, the test has:
a. Construct validity
b. Concurrent validity
c. Predictive validity
d. Content validity
5. Two forms of an essay test in American history were developed and administered to
a group of 35 students. The correlation between the scores on the two forms is a
measure of:
a. Predictive validity of the test
b. Content validity of the test
c. Test reliability
d. Reader agreement
6. The index of criterion validity for a test is computed by finding the correlation
between scores on:
a. The test and those on an external variable
b. Two forms of the test
c. Two halves of the test, such as scores on even-numbered and odd-numbered
items.
d. Two administrations of the test
10. The testing division of a school system is attempting to analyze the traits that are
inherent in the six sub scores of an academic achievement test. The validity of
concern here is:
a. Concurrent
b. Content
c. Construct
d. Predictive
13. A factor analysis is constructed on the scores from six different IQ tests. One of the
factors has a large loading with a single IQ test and very small loading with the
other five tests. This is a:
a. General factor
b. Specific factor
c. Group factor
d. None of the above
14. If a validity coefficient is computed for a test, and the test has been used a very
homogeneous group of student, we expect that the validity coefficient will be:
a. Moderate, around 55
b. High
c. Low
d. Unable to make an inference
16. A test is found to have reliability but low validity. In order for this to occur, the test
has:
a. Little true variance
b. Large error variance
c. Large specific variance
d. Little observed variance
17. When using criterion measures for establishing validity, a validity coefficient is
computed. Theoretically, in terms of variance, the validity coefficient is the square
root of the ratio of:
a. Variance common with the criterion to observed variance
b. Observed variance to variance common with the criterion
c. True variance in the criterion to observed variance
d. True variance in the criterion to true variance in the best test being validated
19. Predictive validity of a test is increased as the groups tested become more
homogeneous.
T F
20. Construct validity refers to the adequacy of item construction for a test.
T F
-10-
CRITERION-REFERENCED TESTS
When the test items are not representative of a well-specified domain we cannot
generalize our results beyond the specific items on the test.
Item forms contain enough detail about how the items should be constructed so that
they represent a well-specified domain.
Teachers can construct criterion-referenced tests through the use of objectives and item
specifications.
Review Items
1. Items forms are seldom used by classroom teachers because the forms:
a. Lack validity
b. Are complex and unwieldy
c. Require extensive pilot testing
d. Are appropriate only for standardized tests
2. Item forms refer to:
a. Item-writing rules
b. Response types (e.g., true-false, multiple-choice)
c. Parallel forms if items for reliability
d. Patterns of responses to sets of items
9. The method of setting standards that is most likely to be used in classroom setting
is the:
a. Professional judgment method
b. Nedelsky method
c. Angoff method
d. Constructing groups method
10. A panel of qualified experts is not used in which of the following methods of setting
standards?
a. Professional judgment
b. Nedelsky
c. Angoff
d. Constructing group method
12. Critics of criterion-referenced test are correct when they characterize standard-
setting procedures as:
a. Vague
b. Subjective
c. Inconsistent
d. Sophisticated
13. Most likely tests that are liked to brief, specific instructional objectives are good
examples of criterion-referenced tests.
T F
14. The item formats (e.g., multiple-choice, essay, etc.) should be different for norm-
referenced tests than for criterion-referenced tests.
T F
15. The “panel of experts” that is used in some standard-setting methods in school
setting consist of:
a. Academically talented students
b. Classroom teachers
c. Parent volunteers
d. Students from higher grades
-11-
ITEM STATISTICS FOR CRITERION-
REFERENCED TESTS
A test score is determined by the performance of the student on each of the items on
the test. In order to understand the test score it is essential that we understand how
each item contributes to that score. The quality of the test depends on the quality items
that comprise it. The procedures that are described in this chapter are ways to look at
the quality of the test items.
Items should be subjected to a content review before the test is given. Experts and
colleagues can help us by reviewing the test items for their match with the domain
specifications or the objectives, for any potentially biased wording, and for any
observable flaws in the item’s construction.
After the tests have been administered and scored, there should be a review of the
kinds of errors that were made so that remediation can focus on these errors. Statistical
analysis should be done so that we have evidence about the difficulty levels of the items
and about the degree to which the items are discriminating between masters and non-
masters or between students before and after instruction.
The difficulty levels of test items often turn out to be quite different from what the
teacher expected. Difficulty levels are clear measure of how the students performed on
a specific task—the test item. As such, they provide very useful information to the
teacher.
The discrimination index provides information that is directly related to the purpose of
the test. A discrimination index can be seen as an analogy to the sport of rowing. If all of
the items on a test are likened to the crew members, we see that things work best when
they are all pulling together. This is the case when all of the discrimination indexes are
positive. If one of the crew lifts his or her oars out of the water and does nothing, it is
like an item with a zero discrimination index. A negative discrimination index would be
the situation of the crew member (items) rowing in the opposite direction as the rest of
the crew. Clearly this latter case requires some correction action.
We cannot merely assume that we create high-quality test items. We need to subject
those items to item analysis in order to convince ourselves and others that the item
analysis is time well spent. The information that is provided will help us to understand
the quality of our tests so that we can base decisions on those test scores with
confidence.
Key Terms and Concepts
Review Items
1. An item with a p value near 1.0 is quite:
a. Easy
b. Difficult
c. Discriminating
d. Reliable
3. Panels with diverse background can be used to examine test items for:
a. Items bias
b. Difficulty
c. Discrimination
d. Continuity
7. If 30 students are tested and 20 answer item 4 correctly, the difficulty index for
items 4 would be:
a. 10
b. -10
c. 33
d. 67
8. The item statistic that would indicate the most serious concern would be:
a. Difficulty equal to 85
b. Difficulty equal to 05
c. Discrimination equal to -50
d. Discrimination equal to 00
9. Items that match a well-specified domain should have difficulty levels that:
a. Are exactly equal
b. Are very similar
c. Range from 0 to 1
d. Match the domain specification
10. The higher the value of the difficulty index, the:
a. Easier the item
b. More discriminating the item
c. Lower the percentage correct on the item
d. More biased the item
-12-
RELIABILITY OF CRITERION-REFERENCED
TESTS
A test is reliable if it provides consistence information about examinees. This can mean
that a criterion-referenced test provides consistent estimates of performance on a
domain or that the test provides consistent placement of an examinee in a mastery or
non-mastery category. Different kinds of reliability evidence are needed for each of
these uses of criterion-referenced tests.
Whether a test is consistent relative to mastery decisions is shown by giving the test on
two occasions to the same group of examinees and finding the percentage of
examinees whose mastery/ non-mastery classifications were both the same on the two
test occasions. This procedure could also be used when a parallel form of the test is
given on the second testing. A reliable test would have a high percentage of examinees
with the same mastery/ non-mastery classification on the two tests.
When performance on a domain is to be estimated from the test scores, the standard
error of measurement can be used to form an interval estimate. An interval estimate
suggests the degree of imprecision that is in our test scores. The standard error of
measurement gives us an idea about how much we can expect test scores to fluctuate
across repeated testing.
The reliability of the test can be increased by careful attention to the test items, the test
setting, and the examinees. A reliable test would have items, the test are
homogeneous. The more similar the items are, the more consistent will be students’
approach to those items. The items should be free of flaws or vagueness of wording so
that inconsistencies are reduced. And, because there is a direct relationship between
the length of the test and the reliability of the test, there should be a sufficient number of
items.
Inconsistencies in student performance can be reduced by making sure that the testing
conditions are appropriate. There should be adequate light and quite so that the student
can concentrate on the task. Interruption or distractions should be eliminated and the
test items and directions about how to answer them should be clear.
Reliable scores depend on the students being motivated to apply themselves to the
task. This is promoted when the teacher encourages the students to do well and
explains how the test scores will be used. The teacher should be alert for individual
student problems such as fatigue or anxiety that might be affecting the reliability of the
test scores.
It is important that we pay attention to the reliability of our tests. With high-quality test
items, q well-controlled test setting, and highly motivated students, very good levels of
reliability can be obtained. However, this is not something that can be left to chance; it
requires conscientious effort.
Review Items
1. Which of these terms is most similar to test reliability?
a. Validity
b. Proficiency
c. Trustworthiness
d. Consistency
2. The same methods that are used estimating the reliability of norm-referenced tests
are used for criterion-referenced tests.
T F
3. The concept of test reliability was developed soon after criterion-referenced testing
was introduced.
T F
4. If a criterion-referenced test is reliable, then the scores from that test are:
a. Useful
b. Standardized
c. Consistent
d. Valid
5. Which one of the following computed for the reliability of a test would indicate that
the test is totally unreliable?
a. 10
b. 00
c. 50
d. 100
6. Exactly 100 students took a criterion-referenced test twice. The test had a mastery
cut-off score; 70 students were above the cut-off score on both tests and 15
students were below the cut-off score on both tests. The reliability of the test for
mastery decisions would be:
a. 40
b. 55
c. 70
d. 85
8. Other things being equal, the longer a test is, the____________will be its reliability.
a. Higher
b. Lower
c. Less ambiguous
d. More valid
9. The reliability coefficients that were developed for norm-referenced test can also be
used effectively with criterion-referenced tests.
T F
10. The same criterion-referenced test was given to 30 children on consecutive days.
Of the children, 10 who surpassed the mastery cut-off score, the first day failed to
do so on the second day. The test could be said to be:
a. Unfair
b. Biased
c. Unreliable
d. Discriminating
12. Test reliability is primarily determined by the test itself. The test setting and the
examinee have a minimal impact on test reliability.
T F
14. When estimating a domain score, the reliability would increase if:
a. The items were more difficult
b. The items were somewhat dissimilar
c. The test had a cut-off score for mastery decisions
d. The test was longer
-13-
VALIDITY OF CRITERION-REFERENCED TESTS
A test that adequately serves the purpose for which it is used is considered to be a valid
test. Validity is always defined in terms of the purpose for which the test scores will be
used. Validity is a matter of degree. One test may be more valid than another but tests
are not usually totally lacking in validity and they are never perfectly valid.
Because criterion-referenced tests are used for several different purposes, including
estimating performance on a domain and determining whether students have achieved
mastery, it is not surprising that different kinds of logical and statistical evidence should
be presented to support the validity claims. The three kinds of test validity that were
introduced are content validity, criterion validity, and construct validity.
Content validity is a determination of the extent that the test items match the domain
specifications or objectives. Validity is established by having qualified persons, a panel
of expert, review the test items for appropriateness and congruence with the domain.
Criterion validity is concerned with whether the test would be an adequate predictor of
performance on some other variable. Validity evidence is established by finding the
correlation coefficient that links the test with the criterion that is to be predicted. The
choice between two competing test would be based on which test has the higher
correlation with the criterion. When we are concerned about mastery decision on two
measures, the degree of validity is shown by the percentage of persons for which the
mastery/ non-mastery decision is consistent.
Construct validity is shown by making predictions about the test scores and then
conducting analyses to see whether the predictions are confirmed. Some of the
reasonable predictions are: (1) the test scores should be positively correlated with other
measures of the same thing, (2) groups that are known to differ on the domain should
have test scores that are significantly different, and (3) we should not find different
patterns of responses across distracters for persons of different races, grades, or other
characteristics.
We cannot merely assume that are valid. We need to conduct careful analyses to show
that our tests have sufficient content, criterion or construct validity so that we can justify
the use of the tests.
Review Items
T F
2. Which of the following is not a validity that is described in the technical standards for
test publisher?
a. Criterion-referenced validity
b. Content validity
c. Construct validity
d. Criterion validity
3. Whether the items on a test match the domain of the criterion-referenced test is
primarily a concerned about:
b. Item validity
c. Content validity
d. Criterion validity
4. Essentially the same processes are used to establish the validity of criterion-
referenced tests as are used with norm-referenced tests.
T F
a. Reliability
b. Content validity
c. Discrimination
d. Criterion validity
a. Content validity
b. Test sales
c. Construct validity
d. User validity
7. If students who surpass the mastery cut-off score for the addition of three-digit
numbers also tend to be those students who achieve master
a. Content validity
b. Criterion validity
c. Convergent validity
d. Mathematical validity
b. Mathematical validity
c. Content validity
d. Criterion validity
a. Content validity
b. Convergent validity
c. Construct validity
d. Criterion validity
10. If items on a criterion-referenced test do not match a well0defined domain, the test
lacks adequate:
a. Construct validity
b. Content validity
c. Criterion-referenced validity
d. Criterion validity
11. If two writers, working from the same test specifications, created test items that
quite different from each other, the test would have inadequate:
a. Criterion validity
b. Item validity
c. Specification validity
d. Content validity
13. When a test does not achieve the purpose for which it was designed, the test lacks:
a. Validity
b. Reliability
c. Purposefulness
d. Discrimination
a. Construct validity
b. Criterion validity
c. Content validity
d. Criterion-referenced validity
-14-
FACTORS THAT AFFECT TEST SCORE
The probability of guessing the correct answer on a selected response question is 1/K,
where K is the number of options.
Spelling and punctuation should not influence the score on an essay item unless the
student is informed beforehand that they will be considered in the scoring.
Characteristics of the student’s test-taking behaviors that can affect test scores include:
Guessing Penmanship
Positional preference Spelling
Changing answer Bluffing
Anxiety Reading difficulty
There are numerous factors, eternal to the knowledge and skills of the students that can
affect test scores. Characteristics of the students and the testing situation can be
influential. It is important to control these factors and minimize their impact because as
they affect the test score, they detract from that score being a measure of the student’s
true score.
Student characteristics are to some extent controlled by the student (i.e., how one
handles the test situation and responds to the items). We discussed factors such as
test0taking skills programs and guessing, as well as mechanical factors such as
penmanship and spelling. We also addressed what is perhaps the greatest student
concern: test anxiety.
Test administration factors that are part of the testing situation were also described.
Often, these factors can be controlled by the teacher. Finally, we concluded that
students should be prepared to take tests as a basic part of their situation because they
have poor test-taking skills. Providing instruction in those skills is not an unethical
attempt to beat the system. Rather, it is an effort to make the test scores as accurate a
measure of performance as possible.
Key Terms and Concepts
Review Items
1. Programs for teaching test-taking skills tend to be:
a. Equally effective throughout grades 1-8
b. More effective with lower grades than upper elementary grades
c. More effective with upper grades than lower elementary grades
d. Of no effect throughout grades 1-8
2. A student who guesses on every test item will have the highest score on which kind
of test? (Assume the tests have equal length)
a. True-false
b. Multiple-choice (four-option)
c. Multiple-choice (five-option)
d. Fill-in-the-blank
4. What is the expected score of a student who guesses on all items of a 50 item
multiple-choice test with five options per item?
a. 0
b. 5
c. 10
d. 15
6. A limit is set on the length of response (in words) to an item. This is an attempt to
limit the effect of:
a. Guessing
b. Bluffing
c. Positional preference
d. Changing answer
7. Which of the following is least related to the others? The effect of:
a. Test anxiety
b. Bluffing
c. Penmanship and spelling
d. Positional preference
9. Students should be shown how to take tests so that the tests provide:
a. Enriched diagnostic information for the future informational planning
b. Information about the test setting, as well as the test content
c. Information about wrong answers, as well as right answer
d. A more accurate picture of what the student is able to do
11. Generally, more test answers are changed from wrong to right than right to wrong.
T F
12. Applying the correction for guessing raises the score on a test.
T F
13. Goo penmanship and spelling tend to be positively correlated with the grades
assigned to essay responses.
T F
14. The physical arrangement of the testing situation is as important for enhancing
student performance as is establishing control and rapport.
T F
15. Separate answer sheets can be used effectively for students beginning with those in
second grade.
T F
-15-
THE USE OF STANDARDIZED ACHIEVEMENT
TESTS
Standardized achievement tests are widely used in our schools. There are many tests
on the market, available in a variety of forms, including norm-referenced and criterion-
referenced tests.
Test results can be reported for individual students, classroom, or even school
buildings. In addition, local and national norms are available for many tests.
Teachers play an important role in standardized achievement testing. They need to take
this role seriously and make sure that the physical and psychological settings promote a
positive testing environment.
Despite their popular use, standardized achievement tests do have certain limitations.
Some of these deal with the time required for the test administration and the processing
of test scores. Other limitations concern the usefulness and accuracy of the reported
scores. Some people worry that the average performance has become a standard of
performance and that achievement tests may have too much influence over school
curricula.
High-quality standardized achievement tests are available; they do the job that they
were designed to do. However, when tests are used for other purposes, their
effectiveness will be limited. Therefore, those who select standardized achievement
tests must do a careful and complete job of comparing alternative.
Review Items
1. What is standardized on a standardized achievement test?
a. The anticipated level of performance
b. The conditions for test administration
c. The test validity
d. The purpose for which the test is given
5. Which of the following factors should be the most important consideration when
selecting a standardized achievement test?
a. Reliability
b. Cost
c. Publisher’s reputation
d. Relevance
7. Most teachers find that the results of standardized achievement test are helpful to
them when planning instruction for individual students.
T F
9. Achievement-at-grade-level is:
a. A meaningless term statistically
b. An expectation that we should have for all students
c. Average performance for students in that grade
d. An arbitrary assessment based on standardized achievement test scores
10. What is a form of testing in which the items that are presented to the student
depend on the student’s answers to previous items?
a. Non-standardized testing
b. Response-dependent testing trials
c. Step-by-step testing
d. Computerized adaptive testing
11. If the items on a standardized achievement test did not match what was taught in a
particular school, the test would lack:
a. Technical adequacy
b. Relevance
c. Utility
d. Reliability
12. The verb that is frequently used in Standards for Educational and Psychological
Testing and that shows the orientation of the Standards is:
a. Should
b. Must
c. Might
d. Shall
13. Some major standardized achievement tests interpret the same test performance in
both norm-referenced and criterion-referenced formats.
T F
14. Standardized achievement tests are among our best example of high-quality test in
education.
T F
15. Which of the following will probably occur within the next decade?
a. The use of standardized achievement tests will decline dramatically
b. Test publishers will provide wither norm-referenced or criterion-referenced tests,
but not both
c. Substantial improvements will be made in the reliability and validity of
standardized achievement test
d. Much of the testing will be done using computer terminals
-16-
TEST BIAS
Test bias is systematic error usually evidence by different performance on the test by
two or more groups or individuals.
The fact the one or more specific groups score poorly on a specific test does not
necessarily prove test bias. Bias occurs when test scores are consistently inaccurate
(either too high or too low) for an individual or group.
When checking for item bias, the focus is on response to individual items rather than the
total test score.
There is a distinction between a test being biased and using test results in a biased
manner. This difference is important in getting at the source of the problem.
The test user is responsible for selecting an appropriate and adequate test for the
specific purpose of testing. He or she must also ensure that the test is administrated in
a consistent and fair manner.
Teachers and school officials must ensure that confidentiality is maintained in the use,
storage, and disposal of test score and interpretations.
Review Items
1. Which situation is an example of test bias?
a. Third-grade boys consistently score lower than third-grade girls on a reading test
b. Women consistently score higher than men on a test used to predict success in a
specific profession
c. All of the above
d. None of the above
4. Members of groups, such as minority groups, are most likely to consider tests
biased if the result are used for:
a. Admission to remedial programs
b. Selection decision
c. Assigning grades
d. Instructional evaluation
6. Litigation over the use of test results has most commonly focused on the use of:
a. Classroom tests for grading purposes
b. Aptitude tests for college admission
c. Achievement tests for school advanced
d. Psychological tests for screening and placement
7. The provision in truth-in-testing laws most troublesome for test publishers is the:
a. Need to do differential item functioning analysis
b. Requirement for sensitivity review
c. Availability of test questions and their correct answer
d. Disclosure of test validity studies
8. Access to test information (that is, the persons having a right to know test scores of
others) is addressed by:
a. Federal legislation through P. L. 94-142
b. State legislation
c. Federal legislation through the Dole and Baker Amendment
d. Federal legislation through the Buckley Amendment
11. When selected groups perform high or low on a test, we have conclusive evidence
of test bias.
T F
12. Test bias (or bias in the scores of a test) occurs if there is random error in the
measurement.
T F
13. Achievement test in academic and skills areas less likelihood of being biased then
intelligence tests or general ability tests.
T F
14. If a test continually presents women performing lower-status, service-type tasks, the
test is performance biased.
T F
15. The Larry P. v Wilson Riles can and the PASE v. Hannon case were concerned with
violations of truth-in-testing laws.
T F
-17-
HIGH-STAKES TESTS
High-stakes tests are those that are perceived to have important consequences for the
examinees. Scoring well or poorly on these tests may determine whether one is
promoted, graduated, acceptable into a particular college, or allowed into a certain
profession. When the test results are used for such important decisions, it is not
surprising that examinees feel intense pressure when preparing for the tests or that the
general public is interested in the tests and how well students do on them. Some people
believe that these controversial tests bring objectivity, fairness, and rigorous standard
into decision about students. Others feel that the usefulness of these tests has been
grossly overstated and that they have important negative effects that outweigh their
usefulness. Our position is that it is important negative to understand the strength and
the weakness of these tests so that informed judgment can be made about the tests.
Minimum competency tests are used in elementary and secondary schools as one basis
for decision about promotion or graduation. Competence tests were meant to be a
safeguard against graduating students who lacked essential skills (i.e., students who
could not read or compute). Although such tests were targeted toward a small
percentage of low-achieving students, they have become a requirement for all students.
The controversy about competence test is concerned with what content and skills
should be included on the test, what the minimal acceptable standard of performance
should be, and what should be done for those students who fail the test. Recent studies
suggest that the impact of these tests has been undermined by ‘safety nets’ such as the
ability to retake the test many times, tinkering with the cut-off score, and the possibility
of overruling the test results.
College admissions tests were also described as high-stakes tests because of how the
ACT and the SAT are used in admissions decisions. These tests have a long history
and their reputation has grown over the years. There is great pressure to do well on
these exams if one is applying to a selective college or university. Proponents of the
tests believe that the test provide a common yardstick for evaluating students. However,
colleges may use the test scores in any way they choose, including not at all. Thus, the
importance of the test score in the college admission process is often overestimates.
Critics of admissions tests indicate that the scores add little to the prediction of
freshman grades beyond information that is available in the high school record. Crouse
and Trusheim (1988) stated that essentially the same freshman class will be admitted
whether the test scores are used or not.
Review Items
1. All minimum competency tests have cut-off scores.
T F
3. The pressure for minimum competency testing programs comes mainly from:
a. Students
b. Teachers
c. Legislators
d. Testing experts
4. The most difficult decision concerning minimum competency testing tends to be:
a. What standard of performance to expert
b. Which students to test
c. What content to include on the test
d. Who should be involved in planning the test
5. The ability to retake a minimum competency test, student exemptions from the test,
and lowering the performance standard by the standard error of measurement have
been called:
a. Practical policy adjustment
b. Failure failsafes
c. Test essements
d. Safety nets
6. The ACT and the SAT are evaluated primarily in terms of their:
a. Standard error of measurement
b. Content validity
c. Reliability
d. Criterion validity
8. Books about how to do well on the SAT or the ACT would most help:
a. Older students
b. Prospective English majors
c. Students who are intimidated tests
d. Students who are unfamiliar with tests
11. The factor that distinguish high-stakes tests from other tests is:
a. The types of test items
b. A mastery cut-off score
c. The consequences of doing well or poorly
d. The cost of taking the tests
12. Which term best describes high-stakes test such as the SAT or the NTE?
a. Invalid
b. Controversial
c. Subjective
d. Unfair
14. The implementation minimum competency testing programs at the state level has
clearly increased the public’s confidence in the public schools.
T F