Educational and Measurement Testing

-1-
INTRODUCTION
National trends have pretty well been set for the 1990s. They indicate continued
emphasis on testing, at least as far as standardized test are concerned, with expanded
uses of test result.
The local school continues to be the implementing agency for assessment programs,
typically those mandated at the state level but also influenced by national trends.
Technically, measurement is the assignment of numerals to objects or events according

to rules that give numerals quantitative meaning.
Measurements may differ in the amount of information the numbers contain. These
differences are distinguished by the terms nominal, ordinal, interval, and ratio scales of
measurements.
The four levels of measurement can be summarized as follows:
1. Nominal scales categorize but do not order.
2. Ordinal scales categorize and order.
3. Interval scales categorize, order, and establish an equal unit in the scale.
4. Ratio scales categorize, order, establish an equal unit, and contain a true zero
point.
Norm-referenced interpretation is a relative interpretation based on an individual’s

position with respect to some group, often called the normative group. Norms consist of
the scores, usually in some form of descriptive statistics, of the normative group.
Criterion interpretation is an absolute rather than relative interpretation, referenced to a

defined body of learner behaviors, or, as is commonly done, to some specified level of
performance.
Diagnostic tests are intended to identify student deficiencies, weaknesses, or problem

and to locate the source of the difficulty. If related learning activities are prescribes, the
term prescriptive tests may be used.
Formative testing occurs over a period of time and monitors students progress.
Summative testing is done at the conclusion of instruction and measures the extent to
which students have attained the desired outcomes.
Key terms and concepts
Tailored Test Assessments

Measurement Test
Levels of Measurement Criterion-referenced
Nominal Scale Diagnostic test
Ordinal Scale Prescriptive test
Interval Scale Formative testing
Ratio Scale Summative testing
Nor-referenced
Review Items
1. Traditionally, the use of test in the schools has been:
a. Predominately norm-referenced.
b. Predominately criterion-referenced.
c. About evenly split, norm-referenced and criterion-referenced.
d. Neither norm-referenced nor criterion referenced.
2. When state legislators have mandated testing, such legislation has been focused
primarily on:
a. Increased classroom testing to improve instruction.
b. Inherent ability testing, for example, using IQ tests.
c. Minimum competency testing.
d. Increased testing of student motivation.

3. Norm-referenced interpretations of test result are directed primarily to the purposes
of:
a. Discriminating among individualism.
b. Discriminating among groups.
c. Discriminating among programs.
d. Discriminating between a program and a standard.
4. Criterion-referenced interpretations of test result are directed to:
a. Relative interpretations of individual scores.
b. Absolute interpretations of individual scores.
c. Relative interpretations of group scores.
d. Absolute interpretations of group scores.
5. Of the terms below, the one having the narrowest meaning is:
a. Evaluation.
b. Assessment.
c. Measurement.
d. Test.
6. What characteristic distinguishes evaluation from measurements?
a. Evaluation requires quantification.
b. Measurements include testing; evaluation does not.
c. Evaluation involves a value judgment.
d. Evaluation includes assessment, measurement does not.

7. Of the four levels of measurement, the one that contains the most information in the
number is:
a. Interval.
b. Nominal.
c. Ordinal.
d. Ratio.
8. If, when reading information about a class, we assign a 1 to girls and a 0 to boys,
this level of measurement is:
a. Interval.
b. Nominal.
c. Ordinal.
d. Ratio.
9. The scores from a typical classroom test are probably better that
____________measurement, but not quite___________.
a. Ordinal, nominal.
b. Interval, ratio.
c. Nominal, ordinal.
d. Ordinal, interval.
10. When using a performance test, the difference between scores of 65 and 70 equals
the difference between scores of 85 and 90, but a score of 80 is not twice a score of
40. The level of measurement is:
a. Interval
b. Nominal.
c. Ordinal.
d. Ratio.
11. Grading on the curve is a criterion-referenced interpretation of test scores.
T F
12. Whether a test is norm-referenced or criterion-referenced depends on the format of

the items included in the test.
T F
13. Most standardized achievement tests are designed for norm-referenced

interpretations.
T F
14. Criterion-referenced tests tend to be more general and comprehensive than

norm0referenced tests.
T F
-2-
PLANNING THE TEST
Tests are given for many different reasons. In order to achieve such diverse purposes,
they need to be carefully planned. In classroom settings, this planning usually entails
instructional objectives and/ or a table of specifications.
When teachers use specific instructional objectives, it becomes very clear what should
be on the test. The desired student behaviors from the objectives translate directly into
the items for the test.
Objectives can be classified in terms of the kind of understanding that is required.

Bloom et al.’s taxonomy is useful in sorting objectives into six hierarchical levels of
understanding. When test are based on such a taxonomy, they are more likely to asses
higher levels of reasoning.
A table of specifications is another tool that is used in test design. A two-dimensional

grid, content by cognitive process, is used to plan the number and kinds of items that
will be on the test.
The use of objectives or a table of specifications may imply strict guidelines for
constructing tests, but much of testing is determined by practical considerations. Test
constructors must consider such things as how much testing time is available, what item
formats the students can handle, and the developmental level of the examinee. A well-
planned test is no accident.
Key terms and concepts
Test purpose Taxonomy of educational objectives

Objectives Table of specifications
Knowledge
Review Items
1. The bloom et al. taxonomy for educational objectives in the cognitive domain is a
hierarchy based on the:
a. Extent of recall required to attain an objective.
b. Level of understanding required to attain an objective.
c. Reading level required to attain an objective.
d. Aptitude required to attain an objective.
2. Tests are constructed for many purposes. Which of these is not a common
purpose of tests?
a. Affirmation.
b. Prediction.
c. Evaluation.
d. Description.
3. A test is useful for one purpose is also likely to be effective for many other
purposes.
T F
4. Test items should be evaluated in terms of how well they match the test’s:
a. Purpose.
b. Reading level.
c. Reliability.
d. Assessment.
5. Explaining in one’s own words what a compromise entails is task at what level of
Bloom’s taxonomy?
a. Application.
b. Knowledge.
c. Comprehension.
d. Analysis.
6. An assignment to design a politically acceptable solution to nuclear waste

disposal would be at which taxonomic level?
a. Synthesis.
b. Knowledge.
c. Comprehension.
d. Analysis.
7. Matching authors’ names and titles of their books is a task at what taxonomic
level?
a. Application.
b. Knowledge.
c. Comprehension.
d. Analysis.
8. A knowledge-level understanding of reproduction is necessary but not sufficient
for a comprehension level of understanding of reproduction.
T F
9. Which is the following is not a part of a well-written instructional objective

according to Mager?
a. A description of the learner.
b. The behavior that is to be observed.
c. The conditions under which the behavior will occur.
d. Criterion of acceptable performance.
10. Which of the following educational goals is not stated in behavioral terms?
a. Read.
b. Understand.
c. List.
d. Count.
11. A table of specifications categories test items by:

a. Content and reading level.
b. Content and cognitive process.
c. Cognitive process and reading level.
d. Item type and cognitive process.
12. A table of specifications is used for:

a. Standards for evaluating a test.
b. Converting test scores to evaluation.
c. Listing instructional objectives.
d. Planning a test.
13. A table of specification is not useful when instructional objectives are available.
T F
14. A table of specifications would be appropriate for an achievement test but not for
a test that is used to predict future academic performance.
T F
15. Almost all important educational outcomes can be expressed in terms of

behavioral objectives.
T F
-3-
SELECTED-RESPONSE ITEMS
Test items can be distinguished by the response required-the response is selected from
two or more options or constructed by the test taker.
The three commonly used selected-response it formats are true-false, multiple-choise,

and matching.
Selected-response items have three general qualities:
1. They can be reliably scored.
2. They can sample the domain of content extensively.

3. They tend to measure memorization of unimportant facts unless care is taken in
constructing the items.
True-false can be effective when a few guidelines are followed in the construction:
1. Statements must be clearly true or false.
2. Statements should not be lifted directly from the text.
3. Specific determiners should be avoided.
4. Trick questions should not be used.
5. Some statements should be written at higher cognitive levels.
6. True and false items should be of the same frequency and length.
Multiple-choice items can be improved by following these guidelines:
1. Avoid grammatical clues.
2. Keep option length uniform.
3. Use plausible distracters.
4. Do not repeat key words from the stem in the options.
Correct answer should be randomly ordered the response-option position. The position
of the correct response should not provide a clue about the correctness of the response.
Complex options (e.g., all of the above; none of the above; or a and b, but not c) should
be used sparingly, if at all.
Multiple-choice items are versatile:
1. They can measure higher cognitive outcomes.
2. They can provide diagnostic information.
Matching items are usually presented in a two-column format: one column consists of
premises and the other consists of responses.
Matching items should contain homogeneous content so that all responses must be
considered plausible answers.
The following guidelines apply to selected-response items formats:
1. Teachers should be aware of the appropriate number of items on the test that can be
guessed correctly.
2. Test items should be independent: The content of one item should not provide the
answers to others, nor should correctly answering one question be a prerequisite to
correctly answering another.
3. Reading level of the test should be lower than the grade level, unless reading is
being tested.
Key Terms and Concepts
Objective item Content sampling

Selected-response item Barrier
True-false item Stem
Multiple-choice Distractors
Matching item Clue
Options Premiere
Review Items
1. Objective items are objective only in their:
a. Item content.
b. Scoring.
c. Distracters.
d. Wording.
2. Many selected-response items can be asked in each testing session; thus, they can
provide good:
a. Levels of difficulty.
b. Objectivity.
c. Content sampling.
d. Time sampling.
3. The objective scoring of selected-response items enhances the test’s:

a. Reliability.
b. Validity.
c. Reliability and validity.
d. Validity but not reliability.
4. When constructing true-false items, it is best to:

a. Use specific determiners.
b. Reproduce statements directly from the text.
c. Include about equal numbers of true statements and false statements.
d. Vary the length of false statements and true statements.
5. Which of the following is not a strength of multiple-choice items?

a. Effective testing of higher cognitive levels.
b. Content sampling.
c. Scorer reliability.
d. Allowing for “educated guesses”
6. When constructing multiple-choice items, it is best to:

a. Make all options the same length.
b. Put the main idea of the options.
c. Use options such as a and b, but not c.
d. Repeat key words from the stem in the options.
7. A matching item includes six events to be matches with nine responses consisting
of dates, cities, and states. The error of item constructing is:
a. Too many premises.
b. Too few premises.
c. Responses contain heterogeneous content.
d. Responses contain homogeneous content.
8. When constructing a matching item, the numbers of premises and responses

should be:
a. Equal, with all responses used only once.
b. Equal, but having the option of using responses more than once.
c. Unequal, with a greater number of responses and any response being used only
once.
d. Unequal, with a greater number of responses and having the option of using
response more than once.
9. The tendency for true-false items to measure trivia is a weakness in the item writer
more that the item format:
T F
10. A good true-false is clearly true or false.

T F
11. Increasing the length of a matching item tends to enhance the homogeneity of
content.
T F
12. Multiple-choice items generally require about the same response time per item as
matching items.
T F
13. The items format most appropriate for measuring knowledge of paired associates,
such as symbols and their meaning, is multiple-choice.
T F
14. The column of a matching item that contains the items stems is called the_______.
15. For usual classroom testing, the most desirable length for a matching item is
between ___________and___________premises.
-4-
CONSTRUCTED-RESPONSE ITEMS
For a short-answer item, the student supplies the answer to a question, association, or
completion form.
In constructing short-answer items, each item should have a unique, correct answer be
structures so the student can clearly recognize its intent.
An essay item is one for which the student structures the response. He or she selects
ideas and then presents them according to his or her own organization and wording.
Essay items are used quite effectively to measure higher-level learning outcomes, such
as analysis, synthesis, and evaluation. Essay testing is not, however, an effective
means of measuring lower-level learning outcomes.
Essay items can be used to measure writing and self-expression skills. Although this
may not be the primary purpose of a given test, it is certainly worthwhile.
Scoring inconsistencies are the primary disadvantage of essay items. In addition,

irrelevant factors, such as neatness and penmanship, may also influence the score.
The extent of response is basically on a continuum, from restricted to extended. Writing

items with the response geared toward the restricted end tends to provide more focus
for the item.
The student must be directed to the desired response. This can be enhanced by
identifying the intended student behaviors and including them in the essay item.
The suggested time for responding to each test item should be provided to the students.
This designates the weight or value of each item and also helps students budget their
time.
Analytic scoring focuses on individual points or components of the response; holistic

scoring considers the response in its entirety, as a whole.
If possible, responses to items should be scored anonymously. In addition, all

responses to one item should be scored before moving on to the next item, rather than
scoring an entire test at a time. Also, the papers should be reordered before scoring the
next time.
Constructed response Completion form

Short-answer item Restricted response
Essay item Extended response
Objective scoring Modal answer
Question form Analytic scoring
Association form Holistic scoring
Review Items
1. In an association form, short-answer item, the spaces for the responses should:
a. Vary according to the length of the correct response.
b. All be the same size.
c. Vary in size, but not according to any order.
d. Vary in size according to some system of ordering.
2. Completion, short-answer items should have the blank(s) placed:

a. At or near the beginning of the item.
b. Between the beginning and the middle of the item.
c. As close to the middle of the item as possible.
d. At or near the end of the item.
3. A “Swiss cheese” completion item has:

a. The blanks evenly spaced throughout the item.
b. Too many blanks.
c. Blanks of unequal size.
d. None of the above.
4. Essay items are popular in teacher-constructed test because:

a. Of the subjectivity in their scoring.
b. They are perceived to be more effective in measuring higher-level outcomes than
objective items.
c. They tend to have greater content sampling than objective items.
d. They tend to have greater reliability than objective items.
5. When scoring essay items, all responses to one item should be scored before
scoring the next time, rather than scoring one entire test before scoring the next.
This procedures:
a. Increases the test validity.
b. Enhances the consistency of scoring.
c. Reduces bias against individual students.
d. Enhances the objectivity of scoring.
6. The test scorer reads the response to one essay item after already reading several
other responses to the same item. The score of this response will tend to be:
a. Higher, if the earlier responses were of poor quality.
b. Higher, if the earlier responses were of high quality.
c. Lower, if the earlier responses were of poor quality.
d. Unaffected by the quality of earlier responses.
7. The halo effect in scoring items is a tendency is to score more highly those
response:
a. Read later in the scoring process.
b. Read earlier in the scoring process.
c. Of students known to be good students.
d. That are technically well written.
8. A student receives a high score on an essay item, due, in part, to the quality of
responses to the item read earlier. This is:
a. A context effect.
b. A hallo effect.
c. A reader-agreement effect.
d. None of the above.
9. Anonymous scoring of essay item responses tends to reduce:

a. Reader agreement.
b. The halo effect.
c. Order effect.
d. Effect due to technical characteristic, such as penmanship.
10. From a measurement standpoint, using classroom test consisting entirely of essay
items is undesirable because:
a. Content sampling tends to be limited.
b. Scoring requires too much time.
c. It is difficult to construct the items.
d. Structuring model responses is too time consuming.
11. Short-answer items are generally easier to construct than matching items.
T F
12. The use of essay items is an effective means of measuring lower-level learning
outcomes.
T F
13. Analytic scoring of essay items tends to be faster than holistic scoring.
T F
14. Analytic scoring of essay items tends to be more objective than holistic scoring.
T F
15. There is a tendency to score longer responses to essay items more highly than
shorter response.
T F
16. Including optional items in an essay exam is a desirable practice.

T F
-5-
NORM-REFERENCED MEASUREMENT
Norm-groups are they referent groups for norm-referenced interpretations of test
scores. Such groups must be appropriate for the individuals tested and the purposes at
hand.
Norm should be representative, relevant, and recent.
Representativeness of the norm group depends on the size of the sample and the
sampling method. The latter has numerous factors associated with it and is the most
likely source of producing biased norms.
Relevance depends on the degree to which the norm group is comparable to the group
under consideration.
National, local, and subgroup norms provide different perspectives for interpreting the
result of tests.
Norms are measure of the actual performance of a group on a test. They are not meant
to be standards of what performance levels should be.
Descriptive statistics are used to summarize characteristics of sets of test scores. The
level of statistics commonly used in measurement is quite basic, requiring only simple
arithmetic operations.
Frequency distributions summarize sets of test scores by listing the number of people
who received each test score. All of the test scores can be listed separately, or the
scores can be grouped in a frequency distribution.
The mean, the median, and the mode all describe central tendency:
1. The mean is the arithmetic average.
2. The median divides the distribution in half.
3. The mode is the most frequent score.
Descriptive statistics that indicate dispersion are the range, the variance, and the
standards deviation. The range is the difference between the highest and lowest scores
in the distribution plus one. The standard deviation is a unit of measurement that shows
by how much the separate score tend to differ from the mean. The variance is the
square of the standard deviation. Most scores are within two standard deviations from
the mean.

Norm group Measure of central tendency
Norms Measures of dispersion
Representiveness Mean
Recency Median
National norms Mode
Local norms Range
Grade equivalent score Variance
Descriptive statistics Standard deviation
Frequency distribution
Review Items
1. When using a norm-referenced interpretation, the student’s score on a test is
compared.
a. A minimum score for passing the test.
b. The score of others taking the test.
c. As expected score based on the student’s ability.
d. A predetermined percentage of correct responses.
2. It is not important that the norm group for a nationally used achievement test:
a. Is large.
b. Is representative.
c. Is from at least three grade levels.
d. Has persons from all states.
3. The extent to which a norm group is comparable to the group being tested
determines the norm group’s
a. Relevance.
b. Representativeness.
c. Recensy.
d. Reliability.
4. The “Lake Wobegon Phenomenon” in testing is the situation of:
a. Norm groups scoring unusually high on standardized tests.
b. Students scoring below the national average on standardized tests.
c. Students scoring above average on locally normed.
d. All states reporting above-average performance on nationally normed tests.
5. Norms for published standardized tests are commonly based on the performance
of:
a. Individual students who will perform well.
b. One or more groups of students.
c. Students in a typical school system.
d. A random sample of students from one state.
6. Local achievement and aptitude norms might be more important than national
norms is decisions about:
a. Future occupation
b. The likelihood of success in certain colleges.
c. Selection into special high-school programs.
d. Allocations among different school district.
7. The central administration of a school district sets a goal of having all elementary-
school students reading at or above the average on a nationally normed test. The
mistake being made is:
a. Making the assumption that the norm group is relevant to the local school.
b. Using the norm as a standard.
c. Attempting to have consistent reading performance in all schools.
d. Establishing too modest a goal.
8. Which of the following is not a measure of central tendency?

a. Mean
b. Variance.
c. Mode.
d. Median
9. When a distribution has a small number of scores, some of which are very extreme,
the preferred measure of central tendency is the:
a. Median
b. Mean
c. Range
d. Mode
10. A measure of dispersion for a distribution, whose computation involves only the
extreme scores, is:
a. Standard deviation
b. Variance
c. Mode
d. Range
11. Which of the following provides a mesure of dispersion in the same units as the
original scores?
a. Variance
b. Median
c. Standard deviation
d. Correlation
12. Measure of central tendency are to location as measures of dispersion are to:
a. Points
b. Spread
c. Average
d. Frequencies
13. If the mode and median of a distribution of scores are equal, the mean will also
have to be equal to the median.
T F
14. When establishing national norms, size of the norm group is a major concern.
T F
15. Generally, the larger the numerical value of the median, the larger the value of the
standard deviation.
T F
-6-
COMPARING SCORES TO NORM GROUPS
When comparing an individual’s score to the scores of the norm group, the point is to
determine where the individual’s score locates in the norm group distribution.
Percentiles indicate the percentage of students in the norm group who are at or below a
particular score.
The standard normal distribution has a mean of 0 and a standard deviation of 1.0. The
area in the Appendix 4 table is given from the mean to the z-score and it is the
proportion of the total area.
Standard score and transformed standard scores express the relative position of a
score in a distribution in terms of standard deviation units from the mean.
Stanines provide equal units of measurement. There are nine stanine scores and the
name comes from “standard nine.” Each stanine contains a band of scores, each band
equal to one-half standard deviation in width.
The NCE score is a normalized standard score with a mean of 50 and a standard
deviation of 21.06. Scores ranga from 1 through 99 and an equal unit is retained in the
scale.
Grade equivalent scores are intended to indicate the average level of performance for
students in each month of each grade. Unfortunately, grade equivalents do not from an
equal interval scale.
Key Term and Concepts
Percentile Stanines
Percentile rank Normalized T-score
Standard score Normal curve equivalents score
Standard normal distribution Grade equivalent score
Transformd standard score
Review Items
1. A student who scores at the 45th percentile on a test:
a. Answered 45 percent if the items correctly.
b. Is above average in performance.
c. Equaled or surpassed 45 percent of the other examinees.
d. Had at least 45 percent of the right answer.
2. Standard scores express an individual’s position in the distribution of scores in

terms of:
a. Standards of performance.
b. Standard deviations from the mean.
c. Standard deviation from the maximum possible score.
d. Deviation from a standard of performance.
3. A test score that is at the 42nd percentile could also be said to be at which stanine?
a. 3rd
b. 4th
c. 5th
d. 6th
4. A z-score of 1.5 would have what value if it were converted to a t-score?

a. 120
b. 50
c. 35
d. 65
5. Which of these cannot be meaningfully average because the scores are ordinal
rather than interval?
a. Percentiles
b. Stanines
c. Standard scores
d. None of the above
6. A student receives a z-score of +1.25 on an exam. This mean the student’s

performance is:
a. Below the mean performance of the group
b. One-quarter of a standard deviation above the mean performance of the group
c. At the average for the group
d. Around the 89th percentile of the group
7. The standard normal distribution has:

a. a mean of 0 and a standard deviation of 1
b. a mean of 50 and a standard deviation of 1
c. a mean of 0 and a standard deviation of 10
d. a mean of 50 and a standard deviation of 10
8. A student’s t-score in a distribution of transformed standard scores is 40. This

student’s performance is:
a. Above average
b. At the 40th percentile
c. At the 7th stanine
d. Below average
9. Stanines divide a distribution into nine parts so that each part:

a. Contains about 11 percent of the scores
b. Is one-half standard deviation wide
c. Represent 10 percentile ranks
d. Contains the mode of distribution
10. In a normal distribution, which of the following indicates the highest relative position
in the distribution of scores?
a. Z = 1.5
b. Percentile rank = 90
c. T = 65
d. Stanine = 8
11. Joe has a stanine score of an exam. Hid performance is:
a. Below the 6th percentile
b. Between the 60th and 77th percentile
c. At the 50th percentile
d. Above the 80th percentile
12. Normal curve equivalent score (NCEs) range from:

a. -3.00 to +3.00
b. 1 to 9
c. 30 to 80
d. 1 to 99
13. Percentiles are more of an ordinal scale than an equal interval scale.
T F
14. When score are converted to percentiles, a specified gain in achievement will result
in a larger increase in percentile rank if the gain is near the high end of the
distribution than near the middle of the distribution.
T F
15. Grade equivalent scores are on an equal interval scale.

T F
-7-
ITEM STATISTICS FOR NORM-REFERENCED
TESTS
This chapter introduced the concepts of analyzing individual items of norm-referenced

tests, that is, how well the items are performing in the total test. The correlation
coefficient was introduced as a descriptive statistic—one that can be used to indicate
the direction and strength of the relationship between two variables. In testing
applications, the variables are often scores on individual items, score on tests or other
measuring instruments, and scores on external criteria that we try to predict, such as
future grade-point average. The correlation coefficient will also be used in future
chapters to develop the concepts of validity and reliability.
Two items statistics—difficulty and discrimination—were introduced and shown to be

useful in evaluating the performance of individual items on norm-referenced tests. The
difficulty index indicates the percentage of persons who answered an item correctly,
whereas the discrimination index shown how well the item separated those who had
high and low scores on the total test. The discrimination index is based on the
correlation between scores on an individual item and those on the total test.
Constructing a perfect test is not likely, especially for the initial draft of the best, even
when we follow the guidelines for good test construction. Confusion, ambiguity, and
poorly constructed options may enter into an item. Students may perceive items
differently than intended by the teacher. Item analysis provides empirical data about
how individual items are performing in a real test situation. Item statistics do not reveal
specifically the deficiencies in the content of items, but they indicate when an item is
deficient. Checking the item difficulty index and the discrimination index may give some
clues as to what is wrong. A careful inspection of the item content and response
patterns of students is often quite revealing.
Correlation coefficient Coefficient of determination

Scatterplot Difficulty index
Pearson product-moment coefficient Discrimination index
Review Items
1. To compute the correlation between attitude and achievement, one must have:
a. Achievement score from one group of people and attitude scores from another
group.
b. Achievement and attitude scores on the same group of people.
c. Achievement scores from two points in time and attitude scores from two points
in time.
d. The same tests given twice to the same group of people.
2. The correlation coefficient is a number that can range of correlation:

a. 0 to +1.00
b. -1.00 to +1.00
c. Minus infinity to plus infinity
d. 0 to 100
3. Which of the following indicates the greatest degree of correlation?

a. – 52
b. – 61
c. + 23
d. + 42
4. The variance of a distribution is a measure of:

a. Dispersion
b. Central tendency
c. Relationship
d. Location
5. Students who can high on an ability measure were found to be able to solve a
learning task much faster than students scoring low on the ability measure. If scores
on the ability measure and time to compete the learning task are correlated, we
would expect:
a. Zero correlation
b. A zero coefficient of determination
c. Positive correlation
d. Negative correlation
6. An exam given to 40 students; 35 of the students respond correctly to an item. The

difficulty index for the item is close to:
a. 0
b. 1
c. 87
d. 40
7. Which difficulty index is indicative of the most difficult item?

a. 90
b. 50
c. 25
d. 12
8. The preferred difficulty index for items of norm-referenced tests in:

a. Close to 1
b. Close to 0
c. Close to 80
d. Close to 50
9. If an item has a high discrimination index, it means that scores on the item have:
a. No correlation with total test scores
b. High correlation with total test scores
c. Low correlation with total test scores
d. Negative correlation with total test scores
10. An item has a negative discrimination index. Thus, if the student responds correctly
to this item, for this student we would expect a:
a. Low total test score
b. High total test score
c. Total test score around the middle
d. Total test score of zero
11. Of the following, which provides information about the distribution of total test
score?
a. Difficulty index
b. Correlation coefficient
c. Discrimination index
d. Standard deviation
12. If we want to identify who is getting the test item correct, low-scores of high-scores,
we would check the difficulty index.
T F
13. An item has a discrimination index around 8. This means that scores on the test are
getting the item correct.
T F
14. An item has a difficulty index close to zero. This means that high-scores on the test
are getting the item correct.
T F
15. The ideal situation for a test is to have high difficulty levels and high discrimination
indices for the items.
T F
-8-
RELIABILITY OF NORM-REFERENCED TESTS
Reliability of measurement is consistency—consistency in measuring whatever the
instrument is measuring.
Stability reliability is consistency of measurement across time.
Test-retest, with the same test administered at different times, provides the estimate of
estimate of stability reliability. The reliability coefficient is the correlation between the
scores of the two test administrations.
Equivalence reliability is consistency of measurement across two parallel forms of a

test.
The split-half procedure divides the test into two parallel halves; the score of the two
halves are then correlated. The reliability of the total test is then estimated using the
Spearman-Brown formula.
The KR-20 formula gives an estimate of internal consistency reliability (r 20 ), which, in

essence, is the mean of all possible split-half coefficient.
The r21 may be substituted for r20 if item difficulty levels are similar; r21 is computationally
easier, but it underestimates reliability if the items vary in difficulty.
The alpha coefficient provides an estimate of internal consistency reliability, based on

two or more parts of a test. If each item is considered a part, the rα is equivalent to r20.
The length affects reliability in such a way that, the longer the test, the greater the
reliability, assuming other factors remain constant.
The Spearman-Brown formula is used for estimating the reliability of increased length. It
is applied when using the split-half procedure since the total test is twice as long as the
individual halves.
Difference scores tend to be less reliable than scores on individual tests. As the
correlation between the scores on the two tests creases, the reliability of the difference
scores decrease.
An observed test score may be considered as consisting of two parts, the true
component and error component.
In considering the distribution of the observed, true, and error scores:
Xo = Xt + Xe and so2 = st2 + se2

Reliability is the proportion of the variance in the observed scores that is true or
nonerror variance.
The standard error of measurement is the standard deviation of the distribution of error
scores. As reliability increase, the standard error of measurement decrease.
We can use the concepts of reliability and standard error of measurement in making
inferences about how an individual’s score would fluctuate on repeated use of the same
test. The distribution of scores would have a mean approaching the individual’s true
score and a standard deviation equal to the standard error of measurement.
Increased range of performance of the students being tested tends to enhance

reliability.
Item similarity enhances reliability and item difficulty affect reliability such that items of
moderate difficulty, around 50 percent correct responses per item, enhance reliability.
Reliability Kuder-Richardson formula-20

Reliability coefficient Kuder-Ricardson formula-21
Test-retest Cronbach alpha
Stability reliability Difference score
Parallel forms Error variance
Equivalence reliability True variance
Interval consistency reliability Standard error of measurement
Spearman-Brown formula
Review Items
1. The reliability coefficient can take on values:
a. From 0 to +1.00, inclusive
b. From – 1.00 to + 1.00, inclusive
c. Of any positive number
d. From – 1.00 to 0, inclusive
2. If a group of student was measured in September using a mathematics
achievement test and then tested again in October using the same test, the
correlation coefficient between the scores of the two test administrations would be
a measure of:
a. Stability reliability
b. Equivalent reliability
c. Interval consistency reliability
d. Both stability and equivalence reliability
3. Reliability estimates of a test:

a. May be based on content or logical analysis of the test
b. Require some correlation coefficient
c. Increase with repeated test usage
d. Are the same for all applications of the test
4. If a reliability estimate is based on a single administration of a test, the reliability of

interest is not:
a. Stability reliability
b. Split-half reliability
c. Equivalence reliability
d. Internal consistency reliability
5. On a given test, the observed standard deviation of the scores is 20, and the
reliability of the test is 84. The standard error of measurement is:
a. 8.00
b. 18.33
c. 3.20
d. 16.80
6. A test of 40 items has reliability of 70. If the test is increased to 80 items, the
reliability will be:
a. 0.99
b. 0.54
c. 0.82
d. 0.90
7. If the reliability of a test is 1.0, the standard error of measurement is:

a. 1.0 also
b. Greater than 1.0
c. Undeterminable
d. 0
8. Identify the reliability estimation procedure appropriate for determining stability

reliability of a test:
a. Split-half
b. Kuder-Richardson formula-20
c. Parallel forms administered at the same time
d. Parallel forms administered at difference times
9. A reading test is given to two groups of sixth-grade students: Group A consists of

high-ability (IQ 120 or greater) students; Group B consists of students of
heterogeneous ability (IQ range 90 to 150). The most likely reliability situation is:
a. Test reliability will be the same for both groups
b. Test reliability will be greater for the group A than group B
c. Test reliability will be greater for group B than group A
d. No inference can be made about test reliability
10. In applying the split-half procedure for estimating reliability, the reliability coefficient
for one-half the test is computed. To estimate the reliability of the entire test, we use
the:
a. Kuder-Richardson formula-20
c. Spearman-Brown formula
d. Cronbach alpha procedure
11. The Kuder-Richardson 20 procedure (KR-20) is a procedure for estimating reliability

that provides:
a. An internal consistency coefficient
b. The mean of all possible split-half coefficients
c. Both and b
d. Neither a nor b
12. A test of 100 items is divided into five subtests of 20 items each. If we are interested
in internal consistency reliability, the most appropriate procedure for estimating
reliability is:
a. Kuder-Richardson formula-20
c. Cronbach alpha
d. Parallel forms
13. A mathematics test is given to a class of gifted students and also to a regular
ungrouped class. The reliability of the test would likely:
a. Greater for the gifted class
b. Greater for the ungrouped class
c. About the sane for both classes
d. Unable to infer anything until the reliability coefficient is computed
14. The standard error of measurement is a measure of:

a. Location
b. Central tendency
c. Variability
d. Association
15. As the standard error of measurement increase, the reliability of a test:
a. Also increase
b. Decreases
c. Remains unchanged
d. May increase or decrease
16. Theoretically, with respect to variance, reliability can be considered the ratio of:
a. Observed variance to true variance
b. Error variance to observed variance
c. True variance to error variance
d. True variance to observed variance
17. Conceptually, the true component and the error component of a test score are such
that:
a. The greater the true component, the greater the error component
b. The greater the true component, the smaller the error component
c. The component are equal
d. The component are independent
18. In conceptualizing the distributions of observed, true, and error scores, the following
is true for the means:
a. The observed mean equals the true mean
b. The error mean equals zero
c. The observed mean equals the true mean plus the error mean
d. All of the above
19. Conceptually, the variances of the distributions of the observed, true, and error
score are such that:
a. The variance of the error scores is zero
b. The error variance plus the true variance equal the observed variance
c. The observed variance is less than the true variance
d. The observed variance and the true variance are equal
20. A difference score is generated by subtracting a pretest score from a posttest score.
In order to obtain a high reliability for the difference score, we require:
a. Low correlation between pretest and posttest scores
b. High reliability for both pretest and posttest scores
c. Both a and b
d. Neither a nor b
-9-
VALIDITY OF NORM-REFERENCED TESTS
Validity is the correct to which a test measure what it is intended to measure.
Content validity is concerned with the extent to which the test is representative of a
defined body of contact consisting of topics and process.
Content validity is based on a logical analysis. It does not general a validity coefficient,
as is obtained with some other types of validity.
Standardized achievement tests tend to have broad content coverage so they will have
wide application. However, when used in a specific situation, the content validity of e
prospective test should always be considered.
Criterion validity is based on the correlation between scores on the test and scores on a
criterion. The correlation coefficient is the criterion validity coefficient.
Concurrent validity is involved if the scores on the criterion are obtained at the same
time as the test scores. Predictive validity is involved if the scores on the criterion are
obtained after an intervening period from those of the best.
Concurrent validity applies if it is desirable to substitute a shorter test for a longer one.
In that case, the score on the longer test is the criterion, and validity is that of the
shorter test.
The construct validity of a measure or test is the extent to which scores can be
interpreted in terms of specified traits or construct.
Factors analysis is a procedure for analyzing a set of correlation coefficient between

measures; the procedure analytically identifies the number and nature of the constructs
underlying the measures. Different types of factors are general, group, and specific
factors.
For test validated through correlation with a criterion measure, validity can be expressed
as the proportion of the observed test variance that is common variance with the
criterion. The validity coefficient is the square root of this proportion or ratio.
A test cannot be valid (either conceptually or practically) if it is not reliable; however. A

reliable test could lack validity. Thus, reliability is a necessary but not sufficient condition
for test validity.
A well- constructed test with items of proper difficulty level will enhance validity. Validity
tends to increase with test length. Low-item inter correlation may tend to enhance
criterion validity if we have a complex criterion.
Increased heterogeneity of the group measure tends to enhance validity. Subtle,
individual factors may also affect validity. Tests should be properly administrated, since
any procedures that impede performance also lower validity.
Validity Factor analysis

Content validity Factor loading
Criterion validity General factor
Validity coefficient Group factor
Concurrent validity Specific factor
Predictive validity Covariation
Construct validity
Review Items
1. Which of the following types of validity does not yield a validity coefficient?
a. Predictive
b. Concurrent
c. Content
d. Criterion
2. When considering the terms reliability and validity, as applied to a test, we can say:
a. A valid test ensures some degree of reliability
b. A reliable test ensure some degree of validity
c. Both a and b
d. Neither a nor b
3. If a test is representative of the skills and topics covered by a specific unit of

instruction, the test has:
a. Construct validity
b. Concurrent validity
c. Predictive validity
d. Content validity
4. If scores on a high-school academic aptitude test are highly and positively related to
freshman GPA in collage, the test has:
b. Concurrent validity
c. Predictive validity
d. Content validity
5. Two forms of an essay test in American history were developed and administered to
a group of 35 students. The correlation between the scores on the two forms is a
measure of:
a. Predictive validity of the test
b. Content validity of the test
c. Test reliability
d. Reader agreement
6. The index of criterion validity for a test is computed by finding the correlation
between scores on:
a. The test and those on an external variable
b. Two forms of the test
c. Two halves of the test, such as scores on even-numbered and odd-numbered
items.
d. Two administrations of the test
7. Which combination of test characteristics is not possible?

a. Low validity and low reliability
b. High validity and high reliability
c. Low validity and high reliability
d. High validity and low reliability
8. Which characteristic is true of criterion validity?
a. It is based on a logical correspondence between two tests
b. It includes two types, concurrent and predictive validity
c. It is based on two administrations of the same test
d. All of the above
9. A school system uses a test considered to be valid for measuring student

achievement, but the test requires three hours of administration time. The principals
and teachers are considering substituting a shorter test for the longer one. The
validity of concern here is:
a. Concurrent
b. Content
c. Construct
d. Predictive
10. The testing division of a school system is attempting to analyze the traits that are
inherent in the six sub scores of an academic achievement test. The validity of
concern here is:
a. Concurrent
b. Content
c. Construct
d. Predictive
11. Construct validity is establish through:

a. Logical analysis
b. Statistical analysis
c. Both logical and statistical analysis
d. Neither logical nor statistical analysis
12. Factor analysis is a procedure often used in establishing construct validity of a set
of tests. In the analysis, the factor loadings that are computed are correlation
coefficients between:
a. Scores on two or more tests of the set
b. Factors and test scores
c. Two or more factor scores
13. A factor analysis is constructed on the scores from six different IQ tests. One of the
factors has a large loading with a single IQ test and very small loading with the
other five tests. This is a:
a. General factor
b. Specific factor
c. Group factor
14. If a validity coefficient is computed for a test, and the test has been used a very
homogeneous group of student, we expect that the validity coefficient will be:
a. Moderate, around 55
b. High
c. Low
d. Unable to make an inference
15. Which of the following is least like the others?

a. Criterion validity
b. Construct validity
c. Concurrent validity
d. Predictive validity
16. A test is found to have reliability but low validity. In order for this to occur, the test
has:
a. Little true variance
b. Large error variance
c. Large specific variance
d. Little observed variance
17. When using criterion measures for establishing validity, a validity coefficient is
computed. Theoretically, in terms of variance, the validity coefficient is the square
root of the ratio of:
a. Variance common with the criterion to observed variance
b. Observed variance to variance common with the criterion
c. True variance in the criterion to observed variance
d. True variance in the criterion to true variance in the best test being validated
18. In order to enhance validity, given a criterion consisting of several abilities, we

would want a test with low-item inter correlations.
T F
19. Predictive validity of a test is increased as the groups tested become more
homogeneous.
T F
20. Construct validity refers to the adequacy of item construction for a test.
T F
-10-
CRITERION-REFERENCED TESTS
A criterion-referenced test score indicates the level of performance on a well-specified

domain of content.
When the test items are not representative of a well-specified domain we cannot
generalize our results beyond the specific items on the test.
Item forms contain enough detail about how the items should be constructed so that
they represent a well-specified domain.
Instructional objectives are usually too terse to provide an adequate description of a

domain. Objectives and test specifications are needed before criterion-referenced tests
are appropriate.
Teachers can construct criterion-referenced tests through the use of objectives and item
specifications.
A standard of minimal acceptable performance is required whenever a decision about

mastery is to be made. There are several methods for setting such standard, none of
them perfect, that can be used with criterion-referenced and other kinds of tests.
Norm-referenced Test specification

Criterion-referenced Stimulus attributes
Domain Response attributes
Item form Mastery decision
Objective Standard setting
Review Items
1. Items forms are seldom used by classroom teachers because the forms:
a. Lack validity
b. Are complex and unwieldy
c. Require extensive pilot testing
d. Are appropriate only for standardized tests
2. Item forms refer to:
a. Item-writing rules
b. Response types (e.g., true-false, multiple-choice)
c. Parallel forms if items for reliability
d. Patterns of responses to sets of items
3. Which of the following is most critical in a criterion-referenced test?

a. A prespecified standard of performance
b. Objectively scored items
c. Specific behavior objectives
d. A well-scored objectives
4. A major strength of criterion-referenced testing is the ability to:

a. Generalize the results to a large set of items
b. Compare individuals in terms of relative standing
c. Establish objective performance criteria
d. Measure difficult to define constructs
5. A poorly defined domain results in items that are:

a. Too difficult
b. Dissimilar
c. Ambiguous to the examinees
d. Unreliable
6. Teachers often prefer criterion-referenced tests to norm-referenced tests because:

a. They are very concerned with which student is best in the class
b. They need to compare the learning in their class to that of their classes
c. Of the specific, discrete knowledge or skills that are assessed rather than global
constructs
d. They are usually easier to construct
7. The test construction concept that is more details than instructional objectives but
less cumbersome than item forms is (are):
a. Item calibrations
b. Test blueprints
c. Item objectives
d. Test specifications
8. When a test has a presets standard of minimum acceptable performance, it is a

criterion-referenced test.
a. Always true
b. Always false
c. Sometimes true
9. The method of setting standards that is most likely to be used in classroom setting
is the:
a. Professional judgment method
b. Nedelsky method
c. Angoff method
d. Constructing groups method
10. A panel of qualified experts is not used in which of the following methods of setting
standards?
a. Professional judgment
b. Nedelsky
c. Angoff
d. Constructing group method
11. Teachers often prefer criterion-referenced measures to norm-referenced measure

because criterion-referenced measures:
a. Are more reliable
b. Are less intimidating to students
c. Indicates what the student can do
d. Indicate who in class has done the best
12. Critics of criterion-referenced test are correct when they characterize standard-
setting procedures as:
a. Vague
b. Subjective
c. Inconsistent
d. Sophisticated
13. Most likely tests that are liked to brief, specific instructional objectives are good
examples of criterion-referenced tests.
T F
14. The item formats (e.g., multiple-choice, essay, etc.) should be different for norm-
referenced tests than for criterion-referenced tests.
T F
15. The “panel of experts” that is used in some standard-setting methods in school
setting consist of:
a. Academically talented students
b. Classroom teachers
c. Parent volunteers
d. Students from higher grades
-11-
ITEM STATISTICS FOR CRITERION-
REFERENCED TESTS
A test score is determined by the performance of the student on each of the items on
the test. In order to understand the test score it is essential that we understand how
each item contributes to that score. The quality of the test depends on the quality items
that comprise it. The procedures that are described in this chapter are ways to look at
the quality of the test items.
Items should be subjected to a content review before the test is given. Experts and
colleagues can help us by reviewing the test items for their match with the domain
specifications or the objectives, for any potentially biased wording, and for any
observable flaws in the item’s construction.
After the tests have been administered and scored, there should be a review of the
kinds of errors that were made so that remediation can focus on these errors. Statistical
analysis should be done so that we have evidence about the difficulty levels of the items
and about the degree to which the items are discriminating between masters and non-
masters or between students before and after instruction.
The difficulty levels of test items often turn out to be quite different from what the
teacher expected. Difficulty levels are clear measure of how the students performed on
a specific task—the test item. As such, they provide very useful information to the
teacher.
The discrimination index provides information that is directly related to the purpose of
the test. A discrimination index can be seen as an analogy to the sport of rowing. If all of
the items on a test are likened to the crew members, we see that things work best when
they are all pulling together. This is the case when all of the discrimination indexes are
positive. If one of the crew lifts his or her oars out of the water and does nothing, it is
like an item with a zero discrimination index. A negative discrimination index would be
the situation of the crew member (items) rowing in the opposite direction as the rest of
the crew. Clearly this latter case requires some correction action.
We cannot merely assume that we create high-quality test items. We need to subject
those items to item analysis in order to convince ourselves and others that the item
analysis is time well spent. The information that is provided will help us to understand
the quality of our tests so that we can base decisions on those test scores with
confidence.
Items analysis Pre and post-discrimination index

Content review Mastery/ non-mastery
Pilot testing Item discrimination index
Error patterns Item difficulty index
Review Items
1. An item with a p value near 1.0 is quite:
a. Easy
b. Difficult
c. Discriminating
d. Reliable
2. Analysis of test result at the item level is useful for:

a. Decisions about individual students
b. Decisions about instruction
c. Decision about the test items
d. All of the above
3. Panels with diverse background can be used to examine test items for:
a. Items bias
b. Difficulty
c. Discrimination
d. Continuity
4. It is critically important in criterion-referenced tests that the test items:

a. Are difficult when used on a pretest
b. Match the domain or objective
c. Discriminate between competent and less competent students
d. Not be difficult
5. The difficulty index refers to:
a. values of student ratings of whether an item was easy or difficult
b. the percentage of examinees who answered an item correctly
c. teacher judgment of how well students are likely to do on an item
d. the likelihood of guessing the correct answer to a test item
6. Pre and post-discrimination refers to whether a test item.

a. Is easier on the posttest than on the pretest
b. Adequately discriminates pre-posts from post-posts
c. Discriminate unfairly against certain ethnic group
d. Would be better placed on a pretest than on a posttest
7. If 30 students are tested and 20 answer item 4 correctly, the difficulty index for
items 4 would be:
a. 10
b. -10
c. 33
d. 67
8. The item statistic that would indicate the most serious concern would be:
a. Difficulty equal to 85
b. Difficulty equal to 05
c. Discrimination equal to -50
d. Discrimination equal to 00
9. Items that match a well-specified domain should have difficulty levels that:
a. Are exactly equal
b. Are very similar
c. Range from 0 to 1
d. Match the domain specification
10. The higher the value of the difficulty index, the:
a. Easier the item
b. More discriminating the item
c. Lower the percentage correct on the item
d. More biased the item
11. If an item has a positive discrimination index:

a. The item should be received
b. The item is biased
c. The item appears to be effective
d. The test will not be valid
12. Item analysis is:

a. A content analysis
b. A statistical analysis
c. Both a and b
13. When can statistical item analysis be done?

a. Before the test is given
b. While the test is being given
c. After the test is given
d. Both b and c
14. Other teachers would be most needed when determining:

a. Item difficulty
b. Item discrimination
c. Item reliability
d. Item bias
15. A test item that is positively discriminating for third-graders would be positively
discriminating for second-graders.
a. Definitely true
b. Possibly true
c. Definitely false
-12-
RELIABILITY OF CRITERION-REFERENCED
TESTS
A test is reliable if it provides consistence information about examinees. This can mean
that a criterion-referenced test provides consistent estimates of performance on a
domain or that the test provides consistent placement of an examinee in a mastery or
non-mastery category. Different kinds of reliability evidence are needed for each of
these uses of criterion-referenced tests.
Whether a test is consistent relative to mastery decisions is shown by giving the test on
two occasions to the same group of examinees and finding the percentage of
examinees whose mastery/ non-mastery classifications were both the same on the two
test occasions. This procedure could also be used when a parallel form of the test is
given on the second testing. A reliable test would have a high percentage of examinees
with the same mastery/ non-mastery classification on the two tests.
When performance on a domain is to be estimated from the test scores, the standard
error of measurement can be used to form an interval estimate. An interval estimate
suggests the degree of imprecision that is in our test scores. The standard error of
measurement gives us an idea about how much we can expect test scores to fluctuate
across repeated testing.
The reliability of the test can be increased by careful attention to the test items, the test
setting, and the examinees. A reliable test would have items, the test are
homogeneous. The more similar the items are, the more consistent will be students’
approach to those items. The items should be free of flaws or vagueness of wording so
that inconsistencies are reduced. And, because there is a direct relationship between
the length of the test and the reliability of the test, there should be a sufficient number of
items.
Inconsistencies in student performance can be reduced by making sure that the testing
conditions are appropriate. There should be adequate light and quite so that the student
can concentrate on the task. Interruption or distractions should be eliminated and the
test items and directions about how to answer them should be clear.
Reliable scores depend on the students being motivated to apply themselves to the
task. This is promoted when the teacher encourages the students to do well and
explains how the test scores will be used. The teacher should be alert for individual
student problems such as fatigue or anxiety that might be affecting the reliability of the
test scores.
It is important that we pay attention to the reliability of our tests. With high-quality test
items, q well-controlled test setting, and highly motivated students, very good levels of
reliability can be obtained. However, this is not something that can be left to chance; it
requires conscientious effort.
Reliability Standard error of measurement

Reliability of mastery/ non-mastery decision Reliability coefficient
Reliability of a domain score estimate
Review Items
1. Which of these terms is most similar to test reliability?
a. Validity
b. Proficiency
c. Trustworthiness
d. Consistency
2. The same methods that are used estimating the reliability of norm-referenced tests
are used for criterion-referenced tests.
T F
3. The concept of test reliability was developed soon after criterion-referenced testing
was introduced.
T F
4. If a criterion-referenced test is reliable, then the scores from that test are:
a. Useful
b. Standardized
c. Consistent
d. Valid
5. Which one of the following computed for the reliability of a test would indicate that
the test is totally unreliable?
a. 10
b. 00
c. 50
d. 100
6. Exactly 100 students took a criterion-referenced test twice. The test had a mastery
cut-off score; 70 students were above the cut-off score on both tests and 15
students were below the cut-off score on both tests. The reliability of the test for
mastery decisions would be:
a. 40
b. 55
c. 70
d. 85
7. When estimating the reliability of a domain score, it is appropriate to use the:

a. Standard error of measurement
b. Average score for the class
c. Range of possible domain scores
d. Measurement error coefficient
8. Other things being equal, the longer a test is, the____________will be its reliability.
a. Higher
b. Lower
c. Less ambiguous
d. More valid
9. The reliability coefficients that were developed for norm-referenced test can also be
used effectively with criterion-referenced tests.
T F
10. The same criterion-referenced test was given to 30 children on consecutive days.
Of the children, 10 who surpassed the mastery cut-off score, the first day failed to
do so on the second day. The test could be said to be:
a. Unfair
b. Biased
c. Unreliable
d. Discriminating
11. A test is either reliable or it isn’t

T F
12. Test reliability is primarily determined by the test itself. The test setting and the
examinee have a minimal impact on test reliability.
T F
13. Which of the following is most related to high criterion-referenced reliability?

a. Item difficulty near 50
b. Item discrimination near 50
c. Short tests
d. A wide range of item types
14. When estimating a domain score, the reliability would increase if:
a. The items were more difficult
b. The items were somewhat dissimilar
c. The test had a cut-off score for mastery decisions
d. The test was longer
15. The longer the test, the smaller the___________.

a. Time between pre- and posttesting
b. Standard error of measurement
c. Difficulty index
d. Reliability discrepancy
-13-
VALIDITY OF CRITERION-REFERENCED TESTS
A test that adequately serves the purpose for which it is used is considered to be a valid
test. Validity is always defined in terms of the purpose for which the test scores will be
used. Validity is a matter of degree. One test may be more valid than another but tests
are not usually totally lacking in validity and they are never perfectly valid.
Because criterion-referenced tests are used for several different purposes, including
estimating performance on a domain and determining whether students have achieved
mastery, it is not surprising that different kinds of logical and statistical evidence should
be presented to support the validity claims. The three kinds of test validity that were
introduced are content validity, criterion validity, and construct validity.
Content validity is a determination of the extent that the test items match the domain
specifications or objectives. Validity is established by having qualified persons, a panel
of expert, review the test items for appropriateness and congruence with the domain.
Criterion validity is concerned with whether the test would be an adequate predictor of
performance on some other variable. Validity evidence is established by finding the
correlation coefficient that links the test with the criterion that is to be predicted. The
choice between two competing test would be based on which test has the higher
correlation with the criterion. When we are concerned about mastery decision on two
measures, the degree of validity is shown by the percentage of persons for which the
mastery/ non-mastery decision is consistent.
Construct validity is shown by making predictions about the test scores and then
conducting analyses to see whether the predictions are confirmed. Some of the
reasonable predictions are: (1) the test scores should be positively correlated with other
measures of the same thing, (2) groups that are known to differ on the domain should
have test scores that are significantly different, and (3) we should not find different
patterns of responses across distracters for persons of different races, grades, or other
characteristics.
We cannot merely assume that are valid. We need to conduct careful analyses to show
that our tests have sufficient content, criterion or construct validity so that we can justify
the use of the tests.
Validity Construct validity

Content validity Logical analysis
Criterion validity Statistical analysis
Correlation validity Distractor analysis
Review Items
1. If a test is valid it is certainly also reliable.
T F
2. Which of the following is not a validity that is described in the technical standards for
test publisher?
a. Criterion-referenced validity
b. Content validity
c. Construct validity
d. Criterion validity
3. Whether the items on a test match the domain of the criterion-referenced test is
primarily a concerned about:
a. Cut-off score validity
b. Item validity
c. Content validity
4. Essentially the same processes are used to establish the validity of criterion-
referenced tests as are used with norm-referenced tests.
T F
5. A well-specified domain for a criterion-referenced test should enhance the test’s:
a. Reliability
b. Content validity
c. Discrimination
6. A panel of experts will sometimes be used to rate items in order to promote:
a. Content validity
b. Test sales
d. User validity
7. If students who surpass the mastery cut-off score for the addition of three-digit
numbers also tend to be those students who achieve master
a. Content validity
b. Criterion validity
c. Convergent validity
d. Mathematical validity
8. The correlation coefficient is a statistical way of expressing:
a. The standard error of measurement
b. Mathematical validity
c. Content validity
9. Which validity requires both a logical process and a statistical process?
a. Content validity
b. Convergent validity
10. If items on a criterion-referenced test do not match a well0defined domain, the test
lacks adequate:
b. Content validity
c. Criterion-referenced validity
11. If two writers, working from the same test specifications, created test items that
quite different from each other, the test would have inadequate:
a. Criterion validity
b. Item validity
c. Specification validity
d. Content validity
12. In order to know whether a test is valid, it is most important to know:
a. The purpose for which the test scores will be used
b. A description the persons who will take the test
c. As estimate of the reliability of the test
d. Whether the test has ever been used before
13. When a test does not achieve the purpose for which it was designed, the test lacks:
a. Validity
b. Reliability
c. Purposefulness
d. Discrimination
14. Lengthening a test will make it more valid.
a. True, if it is somewhat valid to begin with
b. False, test length affects reliability, not validity
c. True, but only for older students
d. False, validity is related to purpose rather than length
15. The primary validity for most criterion-referenced tests is:
b. Criterion validity
c. Content validity
d. Criterion-referenced validity
-14-
FACTORS THAT AFFECT TEST SCORE
The probability of guessing the correct answer on a selected response question is 1/K,
where K is the number of options.
Spelling and punctuation should not influence the score on an essay item unless the
student is informed beforehand that they will be considered in the scoring.
Characteristics of the student’s test-taking behaviors that can affect test scores include:
Guessing Penmanship
Positional preference Spelling
Changing answer Bluffing
Anxiety Reading difficulty
There are numerous factors, eternal to the knowledge and skills of the students that can
affect test scores. Characteristics of the students and the testing situation can be
influential. It is important to control these factors and minimize their impact because as
they affect the test score, they detract from that score being a measure of the student’s
true score.
Student characteristics are to some extent controlled by the student (i.e., how one
handles the test situation and responds to the items). We discussed factors such as
test0taking skills programs and guessing, as well as mechanical factors such as
penmanship and spelling. We also addressed what is perhaps the greatest student
concern: test anxiety.
Test administration factors that are part of the testing situation were also described.
Often, these factors can be controlled by the teacher. Finally, we concluded that
students should be prepared to take tests as a basic part of their situation because they
have poor test-taking skills. Providing instruction in those skills is not an unethical
attempt to beat the system. Rather, it is an effort to make the test scores as accurate a
measure of performance as possible.
Test wiseness Separate answer sheets

Correction for guessing Testing arrangement
Positional preference Take-home exam
Bluffing Oral exam
Test anxiety
Review Items
1. Programs for teaching test-taking skills tend to be:
a. Equally effective throughout grades 1-8
b. More effective with lower grades than upper elementary grades
c. More effective with upper grades than lower elementary grades
d. Of no effect throughout grades 1-8
2. A student who guesses on every test item will have the highest score on which kind
of test? (Assume the tests have equal length)
a. True-false
b. Multiple-choice (four-option)
c. Multiple-choice (five-option)
d. Fill-in-the-blank
3. If a correction rather than a penalty for guessing is used on a multiple-choice test,

students should be urged to:
a. Guess when they are unsure of an answer
b. Not guess when they are unsure of an answer
c. Guess only on items for which they have some partial knowledge
4. What is the expected score of a student who guesses on all items of a 50 item
multiple-choice test with five options per item?
a. 0
b. 5
c. 10
d. 15
5. The relationship between test anxiety and test performance is generally:

a. String and positive
b. Strong and negative
c. Weak and positive
d. Weak and negative
6. A limit is set on the length of response (in words) to an item. This is an attempt to
limit the effect of:
a. Guessing
b. Bluffing
c. Positional preference
d. Changing answer
7. Which of the following is least related to the others? The effect of:
a. Test anxiety
b. Bluffing
c. Penmanship and spelling
d. Positional preference
8. A major with oral examinations is that they tend to be:

a. Very time consuming
b. Very anxiety producing
c. Unreliable
d. Formal rather than informal
9. Students should be shown how to take tests so that the tests provide:
a. Enriched diagnostic information for the future informational planning
b. Information about the test setting, as well as the test content
c. Information about wrong answers, as well as right answer
d. A more accurate picture of what the student is able to do
10. Take-home tests are subject to the following problem:

a. Lack of control over the testing situation
b. They are not appropriate for evaluation purposes
c. Time spent on the test varies among students
d. All of the above
11. Generally, more test answers are changed from wrong to right than right to wrong.
T F
12. Applying the correction for guessing raises the score on a test.
T F
13. Goo penmanship and spelling tend to be positively correlated with the grades
assigned to essay responses.
T F
14. The physical arrangement of the testing situation is as important for enhancing
student performance as is establishing control and rapport.
T F
15. Separate answer sheets can be used effectively for students beginning with those in
second grade.
T F
-15-
THE USE OF STANDARDIZED ACHIEVEMENT
TESTS
Standardized achievement tests are widely used in our schools. There are many tests
on the market, available in a variety of forms, including norm-referenced and criterion-
referenced tests.
Test results can be reported for individual students, classroom, or even school
buildings. In addition, local and national norms are available for many tests.
Publisher of standardized achievement tests are guided by Standards for Educational

and Psychological Testing (1985). These guidelines can also be used by consumers to
evaluate the test information that publisher provide.
Teachers play an important role in standardized achievement testing. They need to take
this role seriously and make sure that the physical and psychological settings promote a
positive testing environment.
Despite their popular use, standardized achievement tests do have certain limitations.
Some of these deal with the time required for the test administration and the processing
of test scores. Other limitations concern the usefulness and accuracy of the reported
scores. Some people worry that the average performance has become a standard of
performance and that achievement tests may have too much influence over school
curricula.
High-quality standardized achievement tests are available; they do the job that they
were designed to do. However, when tests are used for other purposes, their
effectiveness will be limited. Therefore, those who select standardized achievement
tests must do a careful and complete job of comparing alternative.
The major determinant of the appropriateness of a standardized achievement test is the

match between the items and what was actually taught in the schools. Technical
adequacy and cost are also factors.
Standardized achievement test are perhaps our best example of high-quality testing in
education. Careful item preparation, extensive reliability and validity studies, well-design
norm-groups, and clear reporting formats combine to make standardized achievement
tests useful, accurate measure of student performance.
In the future we can expect to see efficient, computerized achievement testing.

However, the types of standardized testing programs that we know now are likely to be
maintained in the 1990s.
Standardized achievement test

Standards for Educational and Psychological Testing
Relevance
Technical adequacy
Usability
Computerized adaptive testing
Review Items
1. What is standardized on a standardized achievement test?
a. The anticipated level of performance
b. The conditions for test administration
c. The test validity
d. The purpose for which the test is given
2. Which of the following uses of standardized achievement test scores is closest to

the main purpose for which such tests are constructed?
a. Assessing the achievement level of an individual student
b. Measuring school, class, and district wide achievement levels
c. Assessing whether students have adequate levels of achievement for promotion
to the next grade
d. Providing objective evidence about the teacher’s competence
3. Standardized achievement tests are norm-referenced rather than criterion-
referenced tests.
T F
4. The Standards for Educational and Psychological Testing are:

a. Guidelines about the use of the test scores
b. Summaries of legal cases concerning the use of test scores
c. Example, including case studies, of the misuse of test score
d. The legal minimum requirements for corporations that sell tests
5. Which of the following factors should be the most important consideration when
selecting a standardized achievement test?
a. Reliability
b. Cost
c. Publisher’s reputation
d. Relevance
6. Mr. Jones allows his student an extra 20 minutes on a standardized achievement

test. What is the major consequence of his actions?
a. The reliability coefficient will increase
b. The content validity will decrease
c. Norm-referenced interpretations of the score will not be meaningful
d. Criterion-referenced interpretations of the scores will not be meaningful
7. Most teachers find that the results of standardized achievement test are helpful to
them when planning instruction for individual students.
T F
8. Most commercially available standardized achievement tests are so excellent that

they can form the sole basis for many decisions about a student’s academic
progress.
T F
9. Achievement-at-grade-level is:
a. A meaningless term statistically
b. An expectation that we should have for all students
c. Average performance for students in that grade
d. An arbitrary assessment based on standardized achievement test scores
10. What is a form of testing in which the items that are presented to the student
depend on the student’s answers to previous items?
a. Non-standardized testing
b. Response-dependent testing trials
c. Step-by-step testing
d. Computerized adaptive testing
11. If the items on a standardized achievement test did not match what was taught in a
particular school, the test would lack:
a. Technical adequacy
b. Relevance
c. Utility
d. Reliability
12. The verb that is frequently used in Standards for Educational and Psychological
Testing and that shows the orientation of the Standards is:
a. Should
b. Must
c. Might
d. Shall
13. Some major standardized achievement tests interpret the same test performance in
both norm-referenced and criterion-referenced formats.
T F
14. Standardized achievement tests are among our best example of high-quality test in
education.
T F
15. Which of the following will probably occur within the next decade?
a. The use of standardized achievement tests will decline dramatically
b. Test publishers will provide wither norm-referenced or criterion-referenced tests,
but not both
c. Substantial improvements will be made in the reliability and validity of
standardized achievement test
d. Much of the testing will be done using computer terminals
-16-
TEST BIAS
Test bias is systematic error usually evidence by different performance on the test by
two or more groups or individuals.
The fact the one or more specific groups score poorly on a specific test does not
necessarily prove test bias. Bias occurs when test scores are consistently inaccurate
(either too high or too low) for an individual or group.
When checking for item bias, the focus is on response to individual items rather than the
total test score.
There is a distinction between a test being biased and using test results in a biased
manner. This difference is important in getting at the source of the problem.
Content bias is introduced if there is disproportionate representation or stereotyping of

certain groups in the test items or if the groups have different familiarity with the content.
The test user is responsible for selecting an appropriate and adequate test for the
specific purpose of testing. He or she must also ensure that the test is administrated in
a consistent and fair manner.
Teachers and school officials must ensure that confidentiality is maintained in the use,
storage, and disposal of test score and interpretations.
Test bias Truth-in-testing laws

Performance bias Buckley Amendment
Item bias Minimum competency testing
Content bias Secure tests
Sensitivity review Informed consent
Differential item-functioning analysis
Review Items
1. Which situation is an example of test bias?
a. Third-grade boys consistently score lower than third-grade girls on a reading test
b. Women consistently score higher than men on a test used to predict success in a
specific profession
c. All of the above
2. A specific test item is biased if one or more groups:

a. consistently respond incorrectly to the item
b. have lower scores than other groups on the item
c. are differentially attracted to the distracters of the item
d. are unable to read the item with understanding
3. In an analysis of content bias of intelligence tests, it was found that:
a. Females tented to be overrepresented
b. Males tended to be overrepresented
c. Nonwhites tended to be overrepresented
d. Essentially no content bias we found
4. Members of groups, such as minority groups, are most likely to consider tests
biased if the result are used for:
a. Admission to remedial programs
b. Selection decision
c. Assigning grades
d. Instructional evaluation
5. The issue of test bias is most likely to be concerned with:

a. Scholastic aptitude tests
b. Attitude inventories
c. Achievement tests
d. Observational inventories
6. Litigation over the use of test results has most commonly focused on the use of:
a. Classroom tests for grading purposes
b. Aptitude tests for college admission
c. Achievement tests for school advanced
d. Psychological tests for screening and placement
7. The provision in truth-in-testing laws most troublesome for test publishers is the:
a. Need to do differential item functioning analysis
b. Requirement for sensitivity review
c. Availability of test questions and their correct answer
d. Disclosure of test validity studies
8. Access to test information (that is, the persons having a right to know test scores of
others) is addressed by:
a. Federal legislation through P. L. 94-142
b. State legislation
c. Federal legislation through the Dole and Baker Amendment
d. Federal legislation through the Buckley Amendment
9. The concept of achievement test fairness (for example, as applied in minimum

competency testing for graduation) basically means that:
a. Students tested are scored in a consistent manner
b. Students are tested on what is taught in their classes
c. No ethnic groups consistently score lower (or higher) than other groups
d. Test administration be standardized for all groups taking the test
10. In order for a test to be biased:

a. One or more minority groups must low scores on the test
b. Minority students must be placed in remedial programs based on test scores
c. A systematic error in the measurement process of the test must occur
d. Random error in the measurement process of the test must exceed the
systematic error
11. When selected groups perform high or low on a test, we have conclusive evidence
of test bias.
T F
12. Test bias (or bias in the scores of a test) occurs if there is random error in the
measurement.
T F
13. Achievement test in academic and skills areas less likelihood of being biased then
intelligence tests or general ability tests.
T F
14. If a test continually presents women performing lower-status, service-type tasks, the
test is performance biased.
T F
15. The Larry P. v Wilson Riles can and the PASE v. Hannon case were concerned with
violations of truth-in-testing laws.
T F
-17-
HIGH-STAKES TESTS
High-stakes tests are those that are perceived to have important consequences for the
examinees. Scoring well or poorly on these tests may determine whether one is
promoted, graduated, acceptable into a particular college, or allowed into a certain
profession. When the test results are used for such important decisions, it is not
surprising that examinees feel intense pressure when preparing for the tests or that the
general public is interested in the tests and how well students do on them. Some people
believe that these controversial tests bring objectivity, fairness, and rigorous standard
into decision about students. Others feel that the usefulness of these tests has been
grossly overstated and that they have important negative effects that outweigh their
usefulness. Our position is that it is important negative to understand the strength and
the weakness of these tests so that informed judgment can be made about the tests.
Minimum competency tests are used in elementary and secondary schools as one basis
for decision about promotion or graduation. Competence tests were meant to be a
safeguard against graduating students who lacked essential skills (i.e., students who
could not read or compute). Although such tests were targeted toward a small
percentage of low-achieving students, they have become a requirement for all students.
The controversy about competence test is concerned with what content and skills
should be included on the test, what the minimal acceptable standard of performance
should be, and what should be done for those students who fail the test. Recent studies
suggest that the impact of these tests has been undermined by ‘safety nets’ such as the
ability to retake the test many times, tinkering with the cut-off score, and the possibility
of overruling the test results.
College admissions tests were also described as high-stakes tests because of how the
ACT and the SAT are used in admissions decisions. These tests have a long history
and their reputation has grown over the years. There is great pressure to do well on
these exams if one is applying to a selective college or university. Proponents of the
tests believe that the test provide a common yardstick for evaluating students. However,
colleges may use the test scores in any way they choose, including not at all. Thus, the
importance of the test score in the college admission process is often overestimates.
Critics of admissions tests indicate that the scores add little to the prediction of
freshman grades beyond information that is available in the high school record. Crouse
and Trusheim (1988) stated that essentially the same freshman class will be admitted
whether the test scores are used or not.
High-stakes tests in education extend to teacher education programs. Teacher

competence tests, an idea of the 1980s, are now used extensively as one of the criteria
for teacher certification. Prospective employers of teachers are also requiring teacher
competency test scores. These tests are so new, though, that many questions remain
about their overall effectiveness.
High-stakes test are interesting, controversial, and newsworthy. It is important that we

understand what these tests can be used for, how they are actually used, and what
impact they have on the examinees, institutions, and society. Much research is needed
before all of this is fully understood.
High-stakes tests Teachers-competency examinations

Competency tests National Teacher Examinations (NTE)
Minimum competency testing Pre-Professional Skills Test (PPST)
Admissions tests
Scholastic Aptitude Test (SAT)
American College Testing (ACT)
Review Items
1. All minimum competency tests have cut-off scores.
T F
2. Critics of minimum competency tests argue that the standards of minimum

competence are:
a. Unreliable
b. Arbitrary
c. Vague
d. Imprecise
3. The pressure for minimum competency testing programs comes mainly from:
a. Students
b. Teachers
c. Legislators
d. Testing experts
4. The most difficult decision concerning minimum competency testing tends to be:
a. What standard of performance to expert
b. Which students to test
c. What content to include on the test
d. Who should be involved in planning the test
5. The ability to retake a minimum competency test, student exemptions from the test,
and lowering the performance standard by the standard error of measurement have
been called:
a. Practical policy adjustment
b. Failure failsafes
c. Test essements
d. Safety nets
6. The ACT and the SAT are evaluated primarily in terms of their:
a. Standard error of measurement
b. Content validity
c. Reliability
7. Crouse and Trusheim indicate that the SAT____________improves the predition of

college grades over just using the high-school record.
a. Slightly
b. Moderately
c. Substantially
8. Books about how to do well on the SAT or the ACT would most help:
a. Older students
b. Prospective English majors
c. Students who are intimidated tests
d. Students who are unfamiliar with tests
9. Teachers examinations are a response to:

a. Poor teaching
b. Poor student performance
c. Lawsuits against teachers
d. Redesign of teacher education programs
10. Teacher competency tests:

a. Have replaced teacher education requirements
b. Are used in addition to traditional requirements
c. Have been used for several decades
d. Guarantee that teachers will be competent
11. The factor that distinguish high-stakes tests from other tests is:
a. The types of test items
b. A mastery cut-off score
c. The consequences of doing well or poorly
d. The cost of taking the tests
12. Which term best describes high-stakes test such as the SAT or the NTE?
a. Invalid
b. Controversial
c. Subjective
d. Unfair
13. Most minimum competency tests consist of:

a. Multiple-choice items
b. Simulations
c. Essay or short-answer questions
d. Demonstrations and observations
14. The implementation minimum competency testing programs at the state level has
clearly increased the public’s confidence in the public schools.
T F
15. All high-stakes tests are minimum competency tests.

T F

Educational and Measurement Testing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Educational and Measurement Testing

Hochgeladen von

Copyright:

Verfügbare Formate

-1-

Technically, measurement is the assignment of numerals to objects or events according

The four levels of measurement can be summarized as follows:

1. Nominal scales categorize but do not order.

2. Ordinal scales categorize and order.

Norm-referenced interpretation is a relative interpretation based on an individual’s

Criterion interpretation is an absolute rather than relative interpretation, referenced to a

Diagnostic tests are intended to identify student deficiencies, weaknesses, or problem

Key terms and concepts

Tailored Test Assessments

1. Traditionally, the use of test in the schools has been:

c. About evenly split, norm-referenced and criterion-referenced.

d. Neither norm-referenced nor criterion referenced.

a. Increased classroom testing to improve instruction.

b. Inherent ability testing, for example, using IQ tests.

c. Minimum competency testing.

d. Increased testing of student motivation.

a. Discriminating among individualism.

b. Discriminating among groups.

c. Discriminating among programs.

d. Discriminating between a program and a standard.

4. Criterion-referenced interpretations of test result are directed to:

a. Relative interpretations of individual scores.

b. Absolute interpretations of individual scores.

c. Relative interpretations of group scores.

d. Absolute interpretations of group scores.

6. What characteristic distinguishes evaluation from measurements?

a. Evaluation requires quantification.

b. Measurements include testing; evaluation does not.

c. Evaluation involves a value judgment.

d. Evaluation includes assessment, measurement does not.

11. Grading on the curve is a criterion-referenced interpretation of test scores.

12. Whether a test is norm-referenced or criterion-referenced depends on the format of

13. Most standardized achievement tests are designed for norm-referenced

14. Criterion-referenced tests tend to be more general and comprehensive than

Objectives can be classified in terms of the kind of understanding that is required.

A table of specifications is another tool that is used in test design. A two-dimensional

Key terms and concepts

Test purpose Taxonomy of educational objectives

6. An assignment to design a politically acceptable solution to nuclear waste

9. Which is the following is not a part of a well-written instructional objective

11. A table of specifications categories test items by:

12. A table of specifications is used for:

15. Almost all important educational outcomes can be expressed in terms of

The three commonly used selected-response it formats are true-false, multiple-choise,

Selected-response items have three general qualities:

1. They can be reliably scored.

2. They can sample the domain of content extensively.

1. Statements must be clearly true or false.

2. Statements should not be lifted directly from the text.

3. Specific determiners should be avoided.

4. Trick questions should not be used.

5. Some statements should be written at higher cognitive levels.

Multiple-choice items can be improved by following these guidelines:

1. Avoid grammatical clues.

2. Keep option length uniform.

3. Use plausible distracters.

4. Do not repeat key words from the stem in the options.

Multiple-choice items are versatile:

1. They can measure higher cognitive outcomes.

2. They can provide diagnostic information.

The following guidelines apply to selected-response items formats:

Key Terms and Concepts