Measurement and Eval

I
IIR
RRI
IIS
SS T
TTo
oop
ppi
iic
ccs
ss I
IIn
nnc
ccl
llu
uud
dde
ee

!
!! A
AAc
ccc
cco
oom
mmm
mmo
ood
dda
aat
tt i
ii o
oon
nns
ss
!
!! B
BBe
eeh
hha
aav
vvi
ii o
oor
rr
!
!! C
CCo
ool
ll l
ll a
aab
bbo
oor
rra
aat
tt i
ii o
oon
nn
!
!! D
DDi
ii s
ssa
aab
bbi
ii l
ll i
ii t
tt y
yy
!
!! D
DDi
ii v
vve
eer
rrs
ss i
ii t
tt y
yy
!
!! L
LLe
eea
aar
rrn
nni
ii n
nng
gg S
SSt
tt r
rra
aat
tt e
eeg
ggi
ii e
ees
ss
!
!! R
RRT
TTI
II

W
WWH
HHA
AAT
TT I
II S
SS I
IIR
RRI
IIS
SS?
??
The IRIS Center for Training Enhancements is based at Vanderbilt Universitys Peabody College and Claremont Graduate
University. The Center, supported through a federal grant from the Office of Special Education Programs (OSEP), creates
enhancement materials and resources for college faculty preparing future education professionals and for professional
development providers who conduct inservice trainings for current school personnel.

W
WWh
hha
aat
tt R
RRe
ees
sso
oou
uur
rrc
cce
ees
ss D
DDo
ooe
ees
ss I
IIR
RRI
II S
SS O
OOf
ff f
ff e
eer
rr?
??
IRIS training enhancements are designed to better prepare school personnel to provide an appropriate education to students with
disabilities. To achieve this goal, the Center has created free course enhancement materials for college faculty and professional
development providers. These materials can be used either as homework or as in-class or training activities.

STAR LEGACY MODULES
Offer challenge-based interactive lessons
Apply the How People Learn (HPL) framework (developed by
John Bransford and colleagues)
Translate research into effective teaching practices
Produce significant learner outcomes

CASE STUDIES
Include three levels of problems to solve
Illustrate evidence-based instructional strategies
Are accompanied by answer keys (upon request)

ACTIVITIES
Activities are created to accompany lectures and professional development training, to be assigned as independent
homework, or to promote discussion. They cover a wide range of topics related to special education and disabilities.

INFORMATION BRIEFS
Information briefs are gathered from a number of sources and are included on the IRIS Web site to offer quick facts and
details on a wide range of disability-related subjects.

WEB RESOURCE DIRECTORY
The Web Resource Directory is a search engine that helps users locate information about special education and disability-
related topics available through other Web sites.

IRIS FILM TOOL
The Film Tool is a comprehensive database of motion pictures featuring or having to do with people with disabilitiessome
of them inaccurate or negativeas a means of stimulating discussions of popular depictions of disabilities.

ONLINE DICTIONARY
The Online Dictionary contains hundreds of definitions of disability and special educationrelated terms, plus cross-links
between definitions for easier searching.

PODCASTS
IRIS downloadable podcasts feature audio interviews with some of the most knowledgeable experts in the field.

All IRIS materials are available at no cost through the IRIS Web site
http://iris.peabody.vanderbilt.edu
Enhance your
program
with these FREE
online resources
from IRIS!
For Training Enhancements
Peabody College at Vanderbilt University
Instructors Manual and Test Bank

for

Thorndike and Thorndike-Christ

Measurement and Evaluation in
Psychology and Education

Eighth Edition

prepared by

Tracy Thorndike-Christ
Western Washington University

Boston New York San Francisco
Mexico City Montreal Toronto London Madrid Munich Paris
Hong Kong Singapore Tokyo Cape Town Sydney

Copyright 2010 Pearson Education, Inc.

All rights reserved. The contents, or parts thereof, may be reproduced Measurement and
Evaluation in Psychology and Education, Eighth Edition, by Robert M. Thorndike and
Tracy Thorndike-Christ, provided such reproductions bear copyright notice, but may not
be reproduced in any form for any other purpose without written permission from the
copyright owner.

To obtain permission(s) to use the material from this work, please submit a written
request to Permissions Department, 501 Boylston Street, Suite 900, Boston, MA 02116;
fax your request to 617-671-2290; or email permissionsus@pearson.com

ISBN-10: 0-13-134765-9
www.pearsonhighered.com ISBN-13: 978-0-13-134765-6

CONTENTS

Chapter 1 Fundamental Issues Measurement...1

Chapter 2 Measurement and Numbers....10

Chapter 3 Giving Meaning to Scores...19

Chapter 4 Qualities Desired in Any Measurement
Procedure: Reliability......30

Chapter 5 Qualities Desired in Any Measurement
Procedure: Validity......41

Chapter 6 Practical Issues Related to Testing.....52

Chapter 7 Assessment and Educational Decision Making.....58

Chapter 8 Assessing Special Populations: Psychometric,
Legal, and Ethical Issues.66

Chapter 9 Principles of Test Development......76

Chapter 10 Performance and Product Evaluation...88

Chapter 11 Attitudes and Rating Scales......94

Chapter 12 Aptitude Tests........105

Chapter 13 Standardized Achievement Tests.........113

Chapter 14 Interests, Personality, and Adjustment.......120

Introduction

This manual has been prepared to aid instructors using the eighth edition of Measurement and Evaluation
in Psychology and Education. Instructors who adopt the book are welcome to reproduce any of the
material contained herein to assist them with their classes. The manual contains six basic kinds of
material for each chapter:

1. A detailed chapter outline
2. A set of study questions.
3. Answers for the study questions.
4. A list of important names and terms.
5. A matching exercise for important names and terms.
6. Multiple choice questions for each chapter with answers.

The chapter outlines are more detailed than those found at the beginning of each chapter in the
text. They are useful to provide students with an overview of what will be covered and can serve as a
lecture outline. Some instructors find it useful to provide students with the outline spaced in such a way
that students can take notes under each topic heading.

The study questions test students knowledge and comprehension of topics covered in the chapter
in a produce-response format. Thus, they can be handed out to students as an aid in preparing for tests,
used as discussion questions for class sessions, or included as short-answer questions on tests. The
matching exercise can be used either as a quiz, as a study aid, or as part of a test.

There are about 25 multiple choice items provided for each chapter. The order of the items
generally follows the order of material in the chapters so that an instructor can readily locate items
dealing with a particular topic or group of topics. It has been my experience over many years of using
these and similar items that an exam of about 40 of these items has an internal consistency reliability of
about .65-.70, and an exam of about 75 items has a somewhat higher reliability. Because tests to cover
the subject matter of a measurement course are heterogeneous with respect to content, and students
achieve differential mastery of that content, internal consistency reliabilities above the .6-.8 level may
indicate a test that is measuring a general trait such as reading ability or test-taking skill rather than course
content.

Enough items are provided that instructors can select those items they find particularly attractive
or can prepare alternate test forms for use over several terms. In large classes where multiple forms of the
test are necessary as a security measure, it is better to rearrange items in the test rather than to use
different items for the forms. Word processing software also makes it possible to have the items in the
same order but to reorder the responses so that the letter of the correct answer is different on the two
forms. Combining these two security procedures can make a test nearly cheat-proof.

All of the items have been edited for this edition and most have been revised to include five
response alternatives. If the items are used in the form given, this reduces the effect of guessing on
students scores; however, instructors who prefer to use four-alternative items can select the three
distractors that they find most effective.

Most of these items have been used several times and have been modified to remove ambiguities
of wording and other problems. These items have shown satisfactory discrimination and difficulty levels
for classes ranging from undergraduate psychology and education majors to first-year graduate students.
I continue to try out new items with my classes each year and will add these new items to the pool as they

are created. If any users would care to send me item analysis data on these items, I would appreciate
receiving them and will use the results in preparing future revisions of the item bank. Editorial comments
on other aspects of the Instructors Manual are also invited and will be incorporated in future editions.
Instructors who have test items they would like to offer for inclusion in the item bank can send me copies,
preferably including item analysis information, at the address below. All those contributing in any of
these ways will be recognized in future editions of the manual.

Please send comments, data, and items to

Tracy Thorndike-Christ, PhD
Department of Special Education
Western Washington University
516 High Street
Bellingham, WA 98225-9040

Chapter 1: Fundamental Issues in Measurement

Chapter Outline

A. Introduction.
B. Little history.
1. Measurement before 1900
a) Chinese practices
b) Sources of the testing movement
c) Mr. Binets test
2. The early period.
a) Tentative exploration and the development of theory.
b) Publication of Stanford-Binet by Louis Terman in 1916.
c) Development of group testing techniques by Arthur Otis.
d) Introduction of the theory of reliability and the g factor by Charles Spearman.
3. The boom period.
a) Development of the Army Alpha and Army Beta.
b) Development of the Woodworth Personal Data Sheet.
c) Measurement of vocational interests by E.K. Strong.
d) Measurement of attitudes and values by L.L. Thurstone.
e) Misuse of test scores.
4. The first period of criticism.
a) Introduction of original Kuder scales.
b) Introduction of the Minnesota Multiphasic Personality Inventory.
c) Introduction of the Wechsler Scales.
d) Refinement of factor analysis by Thurstone.
e) Mental Measurements Yearbooks introduced.
5. The battery period.
a) Identification of different types of dimensions and abilities.
b) Use of testing in military programs.
c) Widespread use of nationally developed, commercially prepared tests.
d) Increased use of tests by business, industry, and the civil service system.
e) Increased use of psychological tests in mental institutions.
f) First APA guidelines on test development.
6. Second period of criticism.
a) Increased concern about the use of ability and personality tests in public education and
industry.
b) Concern about discrimination against women and minorities.
7. Period of accountability
a) Standardized tests required for high school graduation.
b) Schools and districts compared on test scores.
C. Types of decisions.
1. Instructional.
2. Curricular.
3. Selection.
4. Placement or classification.
5. Personal.
1
IMTB for Measurement and Evaluation in Psychology and Education.
D. Measurement and decisions.
1. Values and decisions
a) Facts and values
b) Test scores only give facts
c) Values may conflict with facts
E. Steps in the measurement process.
1. Identifying and defining the attribute.
a) Deciding which attributes are relevant.
b) Choosing the right attributes to describe.
2. Determining operations to isolate and display the attribute.
a) Constructs may be combinations of attributes.
b) Operational definitions
3. Quantifying the attribute.
a) Advantages of quantification
b) Scales are rules for assigning numbers
c) Defining the unit
F. Problems related to the measurement process.
G. Current issues in measurement.
1. Testing minorities.
a) Efforts to remove bias.
b) Some reasons for group differences.
2. Invasion of privacy.
a) What information should be obtained?
b) How should information be used?
3. The use of normative comparisons.
4. Other factors that influence scores.
5. Rights and responsibilities of test takers.
a) Standards for educational and psychological testing
b) APA website
c) AERA and high stakes testing
H. Summary.

Study Questions and Answers

1. When and in what country was systematic measurement introduced?
China, in the first century B.C.

2. When did formal measurement first appear in the West?
During the 19
th
century

3. What two individuals are credited with developing the correlation coefficient?
Karl Pearson and Sir Francis Galton.

4. List the three types of decisions that led to increased interest in human differences.
a. The demand for objectivity and accountability in the assessment of student performance
in schools.
b. Changes in the definition of abnormality by the medical community.
c. The replacement of the patronage system by civil service.

2
5. What event heralded the beginning of the modern era in behavioral measurement?
The publication of the 1905 Binet-Simon test.

6. What was Binets most important contribution to measurement?
The successful attempt to measure complex mental processes.

7. What were E.L. Thorndike and his students endeavoring to do at the same time that Binet was
developing the Binet-Simon test?
Developing tests to measure school abilities.

8. List the six periods in the twentieth century, into which mental testing can be divided.
a. Early period.
b. Boom period.
c. The first period of criticism.
d. Battery period.
e. Second period of criticism.
f. Period of accountability

9. What was the major contribution of each of the following pioneers in testing?
a. Lewis Terman 1916 Stanford-Binet.
b. Arthur Otis Techniques of constructing group mental abilities tests.
c. S.D. Porteus maze test for measuring non-verbal intelligence.
d. Charles Spearman concept of reliability and the theory of a single factor of intelligence.
e. E.K. Strong first vocational interest tests.
f. L.L. Thurstone refined techniques of factor analysis.

10. How did U.S. involvement in World War I spur the advance of testing technology?
Psychologists joined together to develop the first group test of mental ability, the Army Alpha
and the Army Beta, in order to assist the armed services in the process of inducting soldiers into
the armed services.

11. For what two purposes are selection decisions most likely to be made?
To select employees and to determine eligibility for admission to college.

12. In addition to facts, what consideration is important in making decisions?
Values

13. What are the three steps common to all measurement?
a. Identifying and defining the attribute.
b. Determining the operations to isolate and display the attribute.
c. Quantifying the attribute.

14. What do we call the definition of an attribute that is stated in terms of the specific behaviors that
will be accepted as evidence of its existence?
An operational definition.

15. What are the advantages of attaching numbers to behaviors?
a. To make communication more efficient.
b. To apply the power of mathematics to our analysis.

3
16. What aspect of assessment tends to cause the most problems in the assessment of minority
students?
Predicting what a student will be able to do in the future because these types of decisions are
always involve some degree of subjectivity and are prone to error.

17. What factor is of primary importance in determining what constitutes an invasion of privacy?
The purpose of which the information is being used.

18. What type of comparison is involved when the performance of one person is compared to the
average performance of a group?
Normative.

Important Names and Terms

The following are important individuals or terms mentioned in Chapter 1:

a. Alfred Binet
b. Army Alpha
c. Arthur Otis
d. E.K. Strong
e. L.L. Thurstone
f. Louis Terman
g. Sir Francis Galton and Karl Pearson
h. William Rice
i. Charles Spearman
j. Operational definition
k. Construct
l. Scale

Match the description below with one of the terms listed above.

____1. The author of the 1916 Stanford Binet Intelligence Scale.

____2. The proposed ways to scale and measure attitudes and values.

____3. He is credited with being the first to develop a group administered standardized test.

____4. He developed the first modern test of intelligence.

____5. He introduced an early measure of vocational interest.

____6. This was the first group test of mental ability.

____7. He developed the first uniform written spelling examination.

____8. He developed the correlation coefficient.

____9. He developed reliability theory.

4
____ 10. An attribute is defined by how it is measured.

____ 11. An attribute that is not directly observable.

____ 12. The rules for assigning numbers to objects.

Answers to Important Terms

1. f 4. a 7. h 10. j
2. e 5. d 8. g 11. k
3. c 6. b 9. I 12. l

Multiple Choice Items

b. 1.Your text says the primary function of testing is to help people:
a. understand one another.
b. aid decision making.
c. measure basic abilities.
d. learn more efficiently.
e. assess personality.

d. 2. The first systematic program of testing was introduced in:
a. J apan.
b. The United States.
c. Great Britain.
d. China.
e. Germany.

c. 3. The person who developed the first uniform written spelling tests was
a. Francis Galton
b. Alfred Binet
c. J oseph Rice
d. Hermann Ebbinghaus
e. E.L. Thorndike

c. 4. All of the following were considered to be causes of increased interest in individual differences
during the second half of the 19
th
century except
a. Growing demand for accountability in school-related decision making.
b. Refinements in the way the concept of abnormality was defined in the medical
community.
c. Growing interest in the study of neurological functioning.
d. Replacement of patronage system in government with civil service testing.
e. The large number of immigrants coming to the United States.

5
a. 5. The first psychologist to use complex tasks to test intelligence was
a. Alfred Binet
b. Francis Galton
c. Lewis Terman
d. David Wechsler
e. L.L. Thurstone

b. 6. Which of the following would NOT be considered a major influence on measurement?
a. Physiological and experimental psychology (Germany).
b. Phenomenological psychology (Italy).
c. Concern about the intellectual functioning of the mentally subnormal (France).
d. Darwinian biology (England).
e. Functionalist psychology (United States).

e. 7. Charles Spearman is given credit for first describing the theory leading to the development of
the concept of:
a. validity.
b. normative frames of reference.
c. generalizability.
d. test standardization.
e. reliability.

a. 8. The first group testing of verbal ability was done with the:
a. Army Alpha
b. Army Beta
c. Wordsworth Personal Data Sheet
d. Porteus Maze Test
e. Stanford Revision of the Binet-Simon Scales

a. 9. In the history of testing in the United States, the 1930s are considered:
a. a period of criticism and consolidation.
b. the period when educational and psychological testing became big business.
c. a boom period for advances and innovations.
d. a time of rapid expansion in the variety of behaviors subject to measurement.
e. the period when test batteries became popular.

e. 10. To which type of decision does the following question refer? Should Peter be put in an
advanced or regular algebra section?
a. Personal decision.
b. Instructional decision.
c. Curricular decision.
d. Selection decision.
e. Placement decision.

d. 11. The first step in measuring a trait such as creativity is to:
a. determine the situations where creativity is displayed.
b. study some outstandingly creative persons.
c. determine in what units creativity is to be measured.
d. agree upon what we mean by creativity.
e. carefully examine some instances where creative behavior occurred.

6
a. 12. The most important element that measurement cannot supply in the decision process is:
a. a weighting of values.
b. information about the personalities of the individuals involved.
c. information about the precision of test results.
d. a scale of units for reporting test scores.
e. a standardized setting for making observations.

d. 13. The question we should ask before trying to define an attribute of a person is whether:
a. this attribute is relevant to the decisions we must make.
b. we know of any behaviors that exemplify the attribute.
c. procedures for appraising the attribute are sufficiently reliable.
d. people can agree about what the attribute means.
e. people will agree that the attribute is important.

b. 14. The definition of a characteristic of a person by specifying the procedures by which it is
measured, is:
a. a functional definition.
b. an operational definition.
c. a procedural definition.
d. a behavioral definition.
e. a measurement definition.

e. 15.An examination of the instruments that have been devised to measure self-concept will reveal
that they differ from one another in content. These differences are probably due to:
a. variations in self-concept across age groups.
b. differences in construct validity of the instruments.
c. the relevance of the operational procedures to self-concept.
d. the evolving nature of the trait of self-concept.
e. lack of agreement about the definition of self-concept.

a. 16. Which of the following represents an operational definition of curiosity?
a. The number of questions a child asks during a week at school.
b. The range of different topics in which a child is interested.
c. The length of time that a child will spend voluntarily on a puzzle.
d. The quality of a childs exploratory behavior.
e. The childs ability to design experiments.

b. 17. The typical procedure for establishing units in educational and psychological measurement
relies on:
a. a direct comparison of one unit amount with another.
b. a definition that states that any one item on a test is equivalent to any other item.
c. a definition of a unit expressed in physical terms.
d. the intuitive judgment of the teacher or clinician.
e. a rank ordering of people on the trait of interest.

7
d. 18. Which of the following would NOT be considered a problem in the quantification of
behaviors?
a. Difficulties in selecting attributes.
b. Difficulties in identifying the best procedure to elicit traits.
c. Concern about units of measurement.
d. Opposition to the study of covert traits.
e. Difficulty in reaching general agreement about the meaning of a trait.

b. 19. The types of tests that are most suspect when used with minority or other special groups in
society are those that:
a. appraise progress in school.
b. are used to predict later performance.
c. are non-verbal in content.
d. describe the individual at the present time.
e. will be used for personal decisions.

c. 20. The appropriateness of currently used standardized tests of reading for minority groups has
been challenged for all of the following reasons except:
a. the relevance of national norms to minority groups.
b. the content of the passages used as stimulus materials.
c. the differences in objectives in reading for minority and majority groups.
d. differences in test-taking attitudes between minority and majority children.
e. differences in test-taking experience between minority and majority children.

c. 21. A guidance counselor plans to give a set of instruments to all secondary school students who
come in for counseling. Which one of the instruments is most likely to be considered an
undue invasion of privacy?
a. A verbal reasoning test.
b. An instrument designed to appraise leadership abilities.
c. An attitude scale designed to appraise prejudice against minority groups.
d. An interest inventory to appraise liking for various school subjects.
e. An achievement test.

c. 22. An assessment procedure designed to avoid the inappropriate use of norms is:
a. performance testing.
b. benchmark testing.
c. content referenced testing.
d. selection testing.
e. outcome-based testing.

b. 23. Measurement often fails in psychology or education because we are unable to define clearly
the trait that we wish to measure. This would most likely be a problem in attempts to
appraise:
a. scholastic aptitude.
b. good citizenship.
c. reading comprehension.
d. mechanical interest.
e. athletic skill.

8
9
d. 24. In which of the following ways does psychological measurement differ from measurement in
the physical sciences?
I. Only psychological measurements involve measurement error.
II. There is less agreement about what we are measuring in psychological
measurement.
III. The precision of psychological measurement is lower.
a. I only
b. I and II only
c. I and III only
d. II and III only
e. In all three ways

a. 25. Richard, a third-grader, is having considerable difficulty with his school work, particularly
reading. You consult the school records and find a notation that Richards score on a
cognitive ability test was 85. On the basis of this information the most appropriate course of
action would be to
a. seek additional information.
b. recommend putting Richard in a slow reading group.
c. recommend sending Richard back to second grade.
d. recommend to Richards parents that they hire a tutor for him.
e. seek counseling for Richard.

Chapter 2: Measurement and Numbers

Chapter Outline

A. Questions to ask about test scores.
1. Scale
2. Pattern
3. Location
4. Spread
5. Relative position
6. Association
7. Description versus inference
B. Scales
1. Nominal
2. Ordinal
3. Interval
4. Ratio
C. Preparation of frequency distributions.
1. Raw frequency distributions
2. Grouped frequency distributions
a) How many groups
b) Interval width
D. Graphic representations.
E. Measures of central tendency.
1. The mode.
2. The median.
3. Scores as continuums.
4. The concept of the interval
5. Percentiles
6. The arithmetic mean.
7. Central tendency and the shape of the distribution
F. Measures of variability.
1. Range
2. Semi-interquartile range.
3. Standard deviation.
a) Computing the standard deviation.
b) Sample value and population estimate
G. Interpreting the standard deviation.
1. Normal distribution
2. Percent of cases
H. Interpreting the score of the individual.
I. Measures of relationship.
1. Scatter plot.
2. Information provided by the correlation coefficient.
Sign.
Magnitude.
3. Correlations in measurement.
a) Determining the consistency and precision of a measurement procedure.
b) Comparing two procedures to determine whether one is a good predictor.
Describing the relationship between two variables.
J . Summary
10

1. How are descriptive and inferential statistics different?
Descriptive statistics are used to describe the characteristics of a set of scores. Inferential
statistics involve the use of sample statistics to estimate the characteristics of a population.

2. What is the name we give to a table that indicates how many times each score has occurred?
A frequency distribution.

3. How can the frequency distribution be improved in order to make the presentation more concise?
The scores can be grouped into broader categories.

4. How many categories should be used in a grouped frequency distribution?
A practical rule of thumb is to select groups of a size that will result in 15 categories.

5. Explain how to determine the mode, median, mean?
a. Mode Select the score that occurs most frequently.
b. Median This is the point at which there are as many scores above as below. If there are an
odd number of scores and no tied scores at the middle of the distribution the median is the
middle score. In other cases the computation of this value is complex. See pages 37 and 38
in the textbook for an explanation of how to accomplish this.
c. Mean Compute the total of all of the scores and divide this value by the number of scores.

6. When would you use the median rather than the mean?
If a distribution is skewed the median might be selected since extreme scores do not affect its
value.

7. What is the purpose of measures of variability?
They tell us how clustered together or spread apart a set of scores is.

8. What are three methods of determining variability?
Range, semi-interquartile range, and standard deviation.

9. To what family of statistics does the semi-interquartile range belong?
The same family as the median.

10. What is the standard deviation?
It is the square root of the average of the squared deviations from the mean.

11. What are the two forms of the standard deviation? When would each be used?
a. The sample standard deviation is:

2
( ) X M
N
SD

b. The estimate of the population SD is:
2
( )
1
X M
N
SD

11
We use the sample value when we wish to describe the scores we have and are not concerned
with a population. We use the population estimate when we wish to get a best guess about the
spread of scores in a population based on data on a sample from the population.

12. What is the relationship between the standard deviation and the variance?
The standard deviation is the square root of the variance.

13. Under what circumstances would the variance be used?
It is used in more advanced statistical procedures.

14. Why do we square the deviations and then turn around and find the square root of the sum of the
squares?
So that the deviations around the mean will all have positive values (so they wont sum to zero)
which permits further statistical treatment.

15. What is the relationship between the normal curve and the standard deviation?
If a set of scores has a normal distribution there will be a precise mathematical relationship
between the standard deviation and the number of cases.

16. What is the purpose of scatter plots?
To provide a graphic representation of the relationship between two variables.

17. What is the possible range of the correlation coefficient?
Between -1.0 and +1.0

18. What two types of information does the correlation coefficient provide?
a. The sign of the correlation tells us whether two variables rank people in the same direction or
in different directions.
b. The magnitude of the correlation tells us the strength of the relationship.

19. What are the three important settings in which correlation coefficients will be encountered in
connection with testing and measurement?
a. To determine the consistency and precision of a measurement instrument (reliability).
b. When we are interested in the degree of relationship between two variables and when we
want to evaluate the usefulness of one as a predictor of the other (validity).
c. In order to quantify the relationship between variables that in turn can help us understand
how behavior is organized.

Important Terms

The following are important terms mentioned in Chapter 2:

a. Correlation j. Mode
b. Cumulative frequency k. Normal curve
c. Deviation l. Percentile
d. Descriptive statistics m. Range
e. Frequency distribution n. Score interval
f. Histogram o. Semi-interquartile range
g. Inferential statistics p. Standard deviation
h. Mean q. Variance
i. Median
12

____1. The score below which a given percentage of the group falls.

____2. A symmetrical curve having a bell like shape.

____3. One-half of the difference between the first and third quartiles.

____4. The point in a distribution of scores at which there are as many scores above as there are
below.

____5. The range of scores grouped together when constructing a frequency distribution.

____6. The difference between the mean and a score, obtained by subtracting the mean from a score.

____7. The most often used method of determining the average.

____8. The typical score.

____9. A graphical representation of the distribution of scores in which the vertical height of a
column represents the number of scores in each interval.

____10. An index that indicates the relationship between two variables.

____11.How much each score typically deviates around the mean.

____12.The total number of individuals having a score equal to or less than the highest score in an
interval.

____13.Statistics used to describe a set of scores.

____14.Difference between the highest and lowest score.

____15.The average of the squared deviations.

____16.Use of samples to understand populations.

____17.Score that occurs most frequently.

____18. A table that shows the frequency of each score.


1. l 6. c 11. b 16. j
2. k 7. h 12. d 17. e
3. o 8. f 13. m
4. i 9. a 14. q
5. n 10. p 15. g

13
Multiple Choice Questions
b. 1. What do we call statistics that use samples to provide information about a population?
a. descriptive
b. inferential
c. non-parametric
d. theoretical
e. population based

c. 2. You have cognitive ability test scores for 350 children and are preparing to make a frequency
distribution. The scores range from 62 to 134. Which would be the most satisfactory way to group
the scores?
a. 62 63, 64 65, 66 67a, etc.
b. 61 63, 64 66, 67 69, etc.
c. 62 67, 68 72, 73 77, etc.
d. 60 65, 65 70, 70 75, etc.
e. 60 69, 70 79, 80 89, etc.

a. 3. In preparing a histogram score intervals are shown along the:
a. abscissa.
b. ordinate.
c. Y axis.
d. tangential plane.
e. polygon function.

a. 4. What is the mode of the following scores: 2,2,3,4,6,7:
a. 2
b. 3
c. 5
d. 6
e. 10

c. 5. According to your text, a score of 25 should be thought of as meaning:
a. more than 24, but not more than 25.
b. from 25 to just not quite 26.
c. from 24.5 to 25.5.
d. exactly 25.

c. 6. What is the median of the following scores: 25,12,29,12,57?
a. 5.
b. 12.
c. 25.
d. 29.
e. 57.

c. 7. The 50
th
percentile is always the same as the:
a. mode.
b. mean.
c. median.
d. standard deviation.
e. interquartile range.

14
b. 8. The mean and the median will be identical for what kind of distributions:
a. leptokuric distributions.
b. symmetrical distributions.
c. bimodal distributions.
d. skewed distributions.
e. all distributions.

c. 9. What is the mean of the following scores: 2,2,3,4,9?
a. 2
b. 3
c. 4
d. 5

a. 10. In most cases, the mean is a better measure of central tendency than the mode or median because:
a. all scores are used in the computation of the mean.
b. the most frequently occurring score is given more weight.
c. it is the score below which 50 percent of the scores are located.
d. only typical scores are used in its computation.
e. it is not an algebraic function of the scores.

d. 11. In the case of a distribution of scores containing a few scores considerably above or below the rest,
the best method of obtaining a measure of central tendency not affected by these scores would be to
use the:
a. mean.
b. variance.
c. mode.
d. median.
e. percentiles.

d. 12. Which of the following is NOT a measure of variability?
a. range
b. semi-interquartile range
c. variance
d. standard variation
e. standard deviation

c.13. The standard deviation is based on:
a. the difference between the highest and lowest scores.
b. the deviation of each individuals performance from the lowest score.
c. the deviation of each score from the group mean.
d. the range of the middle 50% of scores.
e. the number of scores above the mean.

c.14. What is the relative size of the semi-interquartile range (Q) and the standard deviation (S.D.)?
a. They are essentially equal.
b. Q is uniformly larger.
c. S.D. is uniformly larger.
d. Sometimes one is larger, and sometimes the other.
e. It is impossible to tell without computing them.

15
b.15. The standard deviation is to the semi-interquartile range as the mean is to the:
a. range.
b. median.
c. mode.
d. percentile rank.
e. variance.

e.16. Which of the following measures will be most affected by two or three extreme scores?
a. Interquartile range.
b. Median.
c. Quartile deviation.
d. Mode.
e. Mean.

a.17. In high school, a teacher gave two section of a class the same arithmetic test. The results were as
follows:
Section I: Mean 45, Standard deviation 6.5.
Section II: Mean 45, Standard deviation 3.1.
Which of the following conclusions is correct?
a. Section I is more variable than Section II.
b. Section II is more variable than Section I.
c. Both sections are equally variable.
d. Section II has brighter students than Section I.

he.18.A student who obtains a score 65 in a group where the mean is 74 and the standard deviation is 6
would be:
a. one standard deviation below the mean.
b. two standard deviations above the mean.
c. two standard deviations below the mean.
d. one-and-a-half standard deviations above the mean.
e. one-and-a-half standard deviations below the mean.

c.19. To say that a student fell one-half standard deviation above the group mean on a test in which the
scores had a normal distribution would be about the same as saying that the student fell at the:
a. 95
th
percentile.
b. 85
th
percentile.
c. 70
th
percentile.
d. 30
th
percentile.
e. 15
th
percentile.

e.20. The standard deviation of scores on a certain test is 8. The variance would therefore be:
a. 2
b. 8
c. 16
d. 24
e. 64

16
c.21. Suppose in a history course you took two tests. On Test I, which had a mean of 35 and a standard
deviation of 3 you obtained a score of 38. On test II which had a mean of 60 and a standard
deviation of 15, you obtained a score of 75. On which test did you do better?
a. Test I.
b. Test II.
c. Same on both.
d. The two tests can not be compared.
e. The answer cannot be determined without the frequency distributions.

c.22. In a normal distribution if you are one standard deviation above the mean, what is your approximate
percentile rank?
a. 50
b. 75
c. 84
d. 95
e. 99

d.23. The number 3.12 could not be a:
a. mean.
b. standard deviation.
c. median.
d. correlation coefficient.
e. variance.

e.24. An individual reported a correlation of 1.25 between form A and form B of an intelligence test.
From this coefficient one would conclude that:
a. the test is unusually reliable.
b. the test would be a good predictor of school achievement.
c. there are no errors of measurement.
d. a person scoring above the mean on one form of the test will almost surely score above the
mean on the other form.
e. a mistake has been made in computing the correlation coefficient.

d.25. A research worker gave a scholastic aptitude test to a sample of eighth graders. Then she correlated
the aptitude test scores with the chronological ages of the subjects. She found a correlation of -.42.
How should this result be interpreted?
a. She had obviously made a computational error.
b. Her sample was composed of dull pupils.
c. The relationship between age and intelligence decreases as people reach the age of 14.
d. The older members of the grade tended usually to be dull pupils and vice-versa.
e. None of these interpretations is justified.

c.26. A personality test with four different scales was correlated with success in a job situation, with
results as shown below. Which scale would permit the most accurate prediction of job success?
a. Ascendance r=+.35
b. Introversion r=-.20
c. Neurotic tendency r=-.50
d. Self-Sufficiency r=+.40

17
18
c.27. For making predictions, a test that yields a large negative correlation with a criterion could be
characterized by which of the following:
a. worse than useless.
b. no better than one with a zero correlation.
c. as useful as one with the same sized positive correlation.
d. preferable to any other.
e. any of the above, depending on the tests empirical validity.

a.28. Which of the following correlations would indicate that two tests were measuring unrelated skills?
a. .00
b. -.23
c. .50
d. .85
e. 1.00

b.29. The correlation coefficient is obtained between academic aptitude test scores and academic
achievement: (1) among students in general and (2) among honor students. Other things being
equal, which statement is most likely to be true?
a. The two coefficients will be the same.
b. The first will be higher.
c. The second will be higher.
d. One will be negative, the other positive.
e. They will be unequal, but we have no basis for knowing which will be higher.
Chapter 3: Giving Meaning to Scores

Chapter Outline

A. The nature of a score.
B. Frames of reference.
1. Temporal dimension.
2. Contrast between what a person can do and what they would
like to do.
a) Maximum performance
b) Typical performance
3. Nature of the standard used to make comparisons.
C. Purposes of measurement.
1. Formative evaluation.
2. Summative evaluation.
3. Criterion-referenced tests
4. Norm-referenced tests
5. Achievement versus aptitude
D. Domains in criterion- and norm-referenced tests.
1. Mastery decisions
2. Relative achievement
E. Criterion-referenced evaluation.
F. Norm-referenced evaluation
1. Properties desirable for scores to have
a) Uniform meaning
b) Uniform units
c) Rational zero
d) Setting normative standards
2. Grade norms.
a) Interpretation.
b) Equality of intervals.
c) Independence from content.
d) Comparability from one subject to another.
3. Developmental standard scores
4. Age norms.
a) Interpretation
b) Shortcomings
5. Percentile norms.
a) Comparison groups.
b) Percentile ranks different from percentiles
c) Computation
d) Need for appropriate reference groups
e) Inequality of units.
6. Standard scores.
a) Computation.
19
b) Comparing relative positions.
7. Converted standard scores.
a) C-scores.
b) Alternative conversions.
8. Normalizing transformations.
a) Area transformations
b) Normal curve equivalents
c) Stanines.
9. Comparison of scales
G. Interchangeability of different types of norms.
1. Moving from one scale to another
2. Restrictions
H. Quotients
1. Rationale for the IQ
2. Limitations
I Profiles
1. Ways to present profiles
2. Cautions in interpreting profiles
J . Criterion-referenced reports.
1. Reports by topic
2. Item analysis reports
K. Norms for school averages.
L. Cautions in using norms.
1. Normative scores give relative rather than absolute information.
2. Output must be evaluated in relation to input.
3. Output must be evaluated in relation to objectives.
M. Item response theory (IRT)
1. Scale of task difficulty the b parameter
2. Scale of examinee ability the (theta) parameter
3. Item characteristic curve (trace line)
a) Difficulty
b) Discrimination the a parameter
c) Guessing the c parameter
4. Computer adaptive testing (CAT)
N. Summary.

20

1. What is the most important factor in interpreting a test score?
The context or frame of reference of the measure.

2. What are the three dimensions of the frame of reference of measurement?
a. The temporal dimension.
b. The contrast between what a person can do (maximum performance) and what they would like to do
(typical performance).
c. The nature of the standard against which we compare people.

3. Explain the difference between a norm and criterion- referenced test.
Criterion-referenced tests describe performance in terms of the mastery of specific skills, while norm-
referenced tests describe overall performance in reference to a sample of similar students.

4. What are the three characteristics of an ideal scale?
a. A uniform meaning from test to test.
b. Units of uniform size.
c. A true zero point.

5. What is a grade norm?
The average score obtained by individuals at a particular grade.

6. How are grade equivalents computed?
They are determined mathematically through interpolation and extrapolation.

7. With what type of subject matter are grade equivalent norms most appropriately used?
They work best for subjects taught throughout the school years like reading or math. They are least
useful for subjects taught for only one year or subjects that are first introduced at the secondary level such
as biology and foreign languages.

8. How should a grade equivalent of 7.3 in math obtained by a child in fifth grade be interpreted?
You can say that the child can do fifth grade math as well as a seventh grader can do fifth grade math.

9. How can performance in different subjects for the same child be compared using grade norms?
With great caution. In general, being a year behind or ahead in one subject will not have the same
meaning for other subjects. For instance, being a year ahead in math represents a greater accomplishment
than being a year ahead in reading because fewer students accomplish this feat.

10. Under what circumstances should age norms be used?
For the elementary school years for characteristics that increase as a function of general development
such as height and weight. General cognitive development also can be legitimately described in this way.

11. What is the difference between percentile ranks (norms) and percentiles?
A percentile rank is the proportion of individuals below a certain point on a scale. It is computed only for
obtainable scores. The percentile is the point on the continuous score scale below which a given percent
of individuals fall. You go from the score to the percentile rank and from the percentile to the score (or
point on the score continuum).

12. What is the first and most important issue that we face when using percentile norms?
The selection of the group with which to make the comparison.
21
13. What is the most important characteristic of percentile norms?
They are never normally distributed. The difference between a given number of percentile points is not
the same along the scale (there is a bigger difference between the 5th and 10th percentiles than between
the 50th and 55th percentiles). It is therefore inappropriate to add, subtract, multiply, or divide these
values.

14. How are z-scores computed?
The group mean is subtracted from the score and the resulting value is divided by the standard deviation.

X - M
Z =
____________

SD

15. Why is there a need for standard scores other than the Z- score?
Because Z-scores require plus and minus signs and can be decimals.

16. How are converted standard scores (C-scores) computed?
By multiplying the Z-score by a standard deviation of the users choosing and adding a mean of the users
choosing.

17. How are scores normalized?
By changing the raw scores to percentile ranks and then to the Z-score equivalent to that percentile rank
in the normal distribution. This value can then be converted to a more convenient format such as a scale
with a mean of 50 and a standard deviation of 10.

18. What are the range of values that an NCE score can have?
An NCE score can range from 1 to 99 just like a percentile rank; however, the intervals between values
have been made equal.

19. When can normalized scores by used?
When there is no systematic curtailment of the scale at either end of the distribution and when the
underlying construct being measured can reasonably be thought of as normally distributed.

20. What is the most useful characteristic of stanines?
They measure performance in broad categories which de-emphasizes small differences. They are use to
communicate test results to the lay public who may not understand than small differences are important.

21. How were intelligence quotients (IQ score) first computed? How are they presently computed?
Originally, mental age was divided by chronological age and multiplied by 100. Modern tests that report
such a score (or the same score with a different name) compute it as a normalized standard score
generally with a mean of 100 and a standard deviation of 15.

22. How does the interpretation of percentile norms differ when we compare individual performance with
class performance?
The variation from pupil to pupil is much greater than from school to school.

23. What is a limitation of normative scores?
They tell us how a student performs in relation to other students but not what they have learned.

22
24. How are ability level and item difficulty defined in IRT?
A persons ability level is defined as the point on the ability/difficulty continuum where they have a 50
percent chance of getting an item correct. Item difficulty is defined as the point on the ability/difficulty
continuum where 50 percent of examinees would get the item correct.

25. What are the four parameters that can be used in IRT?
1. Item difficulty b
2. Item discrimination a
3. Guessing c
4. Person ability theta

Important Terms

a. Achievement test l. Percentile norms.
b. Age norms m. Profiles
c. Aptitude test n. Quotients
d. Converted standard scores o. Standard scores
e. Criterion-referenced tests p. Stanines
f. Formative evaluation q. Summative evaluation
g. Grade norms r. z-scores
h. Linear transformations s. a parameter
i. Normal Curve Equivalent scores t. b parameter
j. Normalized T-scores u. c parameter
k. Norm-referenced evaluation v. theta


____ 1. The type of evaluation that takes place at the conclusion of a marking period.

____ 2. Testing intended to guide future instruction.

____ 3. A test that predicts future performance.

____ 4. A test that measures what a student has learned.

____ 5. A test that focuses on the mastery of learning tasks.

____ 6. A norm based on the percentage of students that an individual has exceeded.

____ 7. A norm based on the average performance of students at each grade.

____ 8. A norm based on the average performance of students at different ages.

____ 9. A standard score with a mean of 50 and a standard deviation of 10.

____10. A set of different test scores for an individual expressed with a common unit of measurement.

____11. A normalized score that can range from 1 to 99.

23
____12. A scale with a mean of 50 and a standard deviation of 10 which has been adjusted in such a
way that the scores are normally distributed.

____13. A method of evaluating test results by comparing a students scores with those of other
students.

____14. A standard score that uses just nine integers.

____15. Scores that are based on means and standard deviations.

____16. Standard scores with a mean of 0 and a standard deviation of 1.

____17. Scores based on ratios.

____18. Transformation from one scale to the other in which the relative position of scores stays the
same.

____19. Item discrimination parameter.

____20. Item difficulty parameter.

____21. Guessing parameter.

____22. Person ability parameter.


1. q 6. l 11. i 16. r 21. u
2. f 7. g 12. j 17. n 22. v
3. c 8. b 13. k 18. h
4. a 9. d 14. p 19. s
5. e 10. m 15. o 20. t


e. 1. Gordon received a score of 25 on a test of 100 questions. What interpretation are we justified in
making about this score?
a. The score falls below the median.
b. The score lies at the 25th percentile.
c. The score is one of the lowest in the group.
d. The score is probably below what we should expect for Gordon.
e. No interpretations are possible.

24
c. 2. When a teacher makes a decision about what grade to assign to a student, he or she is engaged in:
a. a process evaluation.
b. a formative evaluation.
c. a summative evaluation.
d. a qualitative evaluation.
e. a value judgment.

b. 3. What is the appropriate frame of reference for a school system that wishes to compare the
performance of its students with those in other similar schools?
a. domain-referenced
b. norm-referenced
c. criterion-referenced
d. curriculum-referenced
e. self-referenced

d. 4. What performance by a student would indicate that a teacher is making a relative mastery judgment
about a student?
a. A score at the 95th percentile on a nationally normed math test.
b. Demonstrated knowledge of all multiplication facts.
c. A fifth place rank in a math class.
d. Successful meeting of the class requirement that students obtain a score of 90 percent
correct on a test of multiplication facts.

a. 5. When would a mastery frame of reference be most appropriate?
a. In a sequential subject such as mathematics.
b. In making decisions about grades.
c. For placement decisions.
d. To make between school comparisons.
e. For deciding whether to promote a high school junior.

c. 6. Which of the following scores appearing in a student's record would be most meaningful without
further reference to the group?
a. 23 items correct in an English test of 40 items.
b. 30 items wrong in an algebra test of 50 items.
c. 100 words per minute on a test of keyboarding.
d. Omitted 10 items in each of the English and algebra tests.
e. None of these can be interpreted without a reference group.

a. 7. An achievement test in reading for the junior high school was standardized by administering it to a
representative sample of seventh, eighth, and ninth graders. A student received a grade equivalent
score on the test of 6.4. This grade equivalent was probably derived using:
a. extrapolation.
b. interpolation.
c. standard deviation equivalents.
d. heterogeneous grade norms.
e. correlating the test scores with grades.

25
a. 8. For which of the following would grade norms be suitable?
a. A reading test for the fifth grade.
b. A chemistry test for the eleventh grade.
c. A mechanical aptitude test for selecting machinist apprentices.
d. A college aptitude test.
e. A test to determine a person's grade in the army.

c. 9. A grade equivalent of 6.0 in math for a 4th grader means the student can do:
a. 6th grade math.
b. some 6th grade but mostly 4th grade math.
c. 4th grade math as well as a 6th grade child can do 4th grade math.
d. no 4th grade math.
e. as well as 6 percent of 4th graders.

a. 10. Why is it not sensible to expect the typical school system to bring all fourth graders up to the
fourth-grade norm on a standardized achievement test in such a subject as reading? Because the
norm:
a. is an average, rather than a minimum standard.
b. is a minimum standard, rather than an average.
c. is designed for average and above average children.
d. takes no account of limited social and cultural background.
e. has been moved up in recent years.

d. 11. For which of the following tests would one be most likely to use age norms?
a. A test of achievement in chemistry.
b. A questionnaire on personal adjustment.
c. An aptitude test for selecting airplane pilots.
d. Test of motor skills for elementary-school children.
e. A measure of height for military recruits.

c. 12. For guidance work with senior high-school students, one would generally find it most convenient
to work with what kind of norms?
a. age norms.
b. grade norms.
c. percentile norms.
d. quotient norms.
e. local norms.

e. 13. Robert is told that his score on a standardized reading test places him at the 15th percentile for his
norm group. He might justly infer that:
a. he answered 15 percent of the items correctly.
b. his performance was exceeded by 15 percent of the norm group.
c. 15 percent of the group received the same score that he did on the test.
d. he is performing at his expected level.
e. he did better than 15 percent of the norm group.

26
b. 14. Mary fell at the 80th percentile on the 100-item final examination in history given to her class of 50
pupils at the school. This means that she:
a. got 80 items right.
b. was better than 40 pupils in the class.
c. was exceeded by 40 pupils in the class.
d. was exceeded by 80 percent of the class.
e. did better than should have been expected.

a. 15. On the Wechsler Intelligence Scale for Children, J ohn fell at the 98th percentile on the V (verbal)
score and the 90th percentile on the P (performance) score. Henry fell at the 55th and 45th
percentiles on the same two scores. Who showed the greater unevenness in performance?
a. J ohn
b. Henry
c. There was no difference.
d. The data provide no basis for judging.

a. 16. If percentile norms are to provide a meaningful picture of an individual's performance, they must
be:
a. based on a group of which the individual is a member.
b. expressed in equal units of score.
c. translated into quotient values.
d. revised every year or two.
e. provided to the nearest whole percent.

c. 17. What advantage do standard scores have over percentiles?
a. They are easily understood.
b. They can be used for any age group.
c. The represent equal units of ability.
d. They are more easily computed.
e. None, standard scores have no advantage over percentiles.

b. 18. A distribution of Z-scores has a mean of ___________ and a standard deviation of ___________.
a. 1 and 0.
b. 0 and 1.
c. 5 and 2.
d. 50 and 10.
e. 100 and 15.

b. 19. A Z score of -.10 indicates:
a. the raw score is far below the mean.
b. the raw score is slightly below the mean.
c. there is an inverse relationship.
d. the distribution is skewed.
e. a mistake has been made in computation.

27
b. 20. The mean of a test is 38. You get a 44 and learn that is equivalent to a T score of 65. What is the
standard deviation of the test.
a. 2
b. 4
c. 6
d. 10
e. 15

b. 21. Which of the following has a different meaning for a percentile than for an NCE?
a. 1
b. 25
c. 50
d. 99
e. They are all equivalent because NCEs are based on percentiles.

d. 22. What type of norms do modern intelligence quotients most basically resemble?
a. Age norms.
b. Grade norms.
c. Percentile norms.
d. Standard score norms.
e. Ratio norms

e. 23. Which type of norm comes closest to providing really equal units for measurement in psychology
and education?
a. Quotient norms.
b. Age norms.
c. Grade norms.
d. Percentile norms.
e. Standard score norms.

a. 24. The navy reports aptitude test results in terms of standard scores with a mean of 50 and a standard
deviation of 10. A recruit with a mechanical comprehension score of 65 is a candidate for
machinist training. On the basis of this score which of the following judgments would be most
justified?
a. She is a very promising candidate.
b. She is likely to prove about average.
c. She is a borderline case.
d. She is definitely a poor risk.
e. None of the above because mechanical comprehension does not measure the duties of a
machinist.

b. 25. A twelfth-grade boy got a standard score of 500 on the Scholastic Assessment Test (SAT). From
this we could conclude that he:
a. fell at the average of high school seniors.
b. fell at the average of students taking the SAT.
c. has an average chance of success in college work.
d. should be encouraged to attend a college of only average difficulty.
e. should retake the test in an attempt to raise his score.

28
29
d. 26. Test makers frequently assign standard score equivalents in such a way as to normalize the
resulting distribution of scores. They do this because they believe:
a. the target group is homogeneous.
b. the trait measured depends on many different genes.
c. item difficulty is normally distributed.
d. the original raw-score units were not really equal.
e. people inherently form a normal distribution.

e. 27. The primary advantage claimed for stanines over T-scores is that the stanines:
a. provide a norm with equal units.
b. are comparable from test to test.
c. involve no statistical assumptions.
d. are easier to understand
e. reduce the tendency to over-interpret small score differences.

c. 28. Some current tests report results as percentile bands that include a range of percentiles. An
advantage claimed for this procedure is that it:
a. expresses results in equal units.
b. presents results in a form that is easy for the layman to interpret.
c. reduces the tendency to over-interpret small differences.
d. keeps statistical assumptions to a minimum.
e. emphasizes the precise nature of test scores.

c. 29. If a J asons theta score exactly equals the b-parameter of an item, what is the probability that he
will get the item correct?
a. less than chance
b. less than 50%
c. just about 50%
d. somewhat over 50%
e. he is just about certain to get the item correct

a. 30. Which item is more discriminating?
a. one with an a-parameter of .3
b. one with a b-parameter of .8
c. one with a c-parameter of zero
d. one for which a =theta
e. one for which b =theta

b. 31. If you get an item correct on a computer adaptive test, what is most likely to happen?
a. the next item will be easier
b. the next item will be more difficult
c. the next item will be of the same difficulty as the previous one.
d. the estimate of your theta score will be reduced.

Chapter 4: Qualities Desired in Any Measurement Procedure:
Reliability

Chapter Outline

A. Introduction.
B. Reliability as consistency.
1.True and error scores
a) True score is constant
b) Random errors reveal lack of reliability
2. Sources of inconsistency.
a) The person may have actually changed.
b) The task may have been different
c) The limited sample of behavior.
C. Two ways to express reliability.
1. Standard error of measurement
2. Reliability coefficient
3. Relationship between SD
e
and r
tt

D. Ways to assess reliability.
1. Retest with the same test.
a) Variation in the individual from time to time.
b) Variation due to the sample of tasks.
2. Parallel test forms.
a) Definition of alternate forms.
b) With and without time interval.
c) Practical limitations.
3. Single administration methods.
a) Subdivided tests.
1) Requirement of equivalent halves
2) Methods of dividing a test.
3) Spearman-Brown Prophecy formula.
b) Internal consistency reliability.
1) Theoretical rationale.
2) Coefficient alpha.
3) KR-20
4) KR-21
5) Independence from changes over time.
6) Items not independent
7) Items must be homogeneous
8) Inappropriateness when a test is speeded.
4. Comparison of methods.
a) Need for more than one form of reliability data.
b) Sources of variation for each method.
30
Chapter 4: Qualities Desired in Any Measurement Procedure
E. Interpretation of reliability data.
1. Relationship between reliability and the standard error of measurement.
2. Relating standard error of measurement to the normal distribution.
a) Error in estimating true score
b) Change on retest
3. Reliability coefficient
a) For comparing different tests
b) A ceiling for validity
F. Factors affecting reliability.
1. Variability of the group.
2. Level of ability in the group.
a) Floor and ceiling effects
b) Effect of the distribution of item difficulties
3. Length of the test.
a) Adding items
b) Multiple raters
c) Shortening a test
4. Operations used for estimating reliability.
5. Practical versus theoretical reliability
G. Minimum level of reliability.
H. Reliability of difference scores.
1. Effect in assessing gains
2. Interpreting profiles
I. Effects of unreliability on correlation between variables.
1. Errors are uncorrelated
2. Correction for attenuation
J . Reliability of criterion-referenced tests.
1. Reliability of mastery decisions
2. Reliability of domain scores
K. Reliability of computer adaptive tests
1. Item information
2. The item information function
3. Standard error of measurement from test information
L. Summary


1. How are reliability and validity related?
Validity is the essential quality for a test to have. Reliability is a necessary precondition for validity.
Reliability sets an upper limit to validity.

2. What elements make up a person's score?
A person's score is made up of his or her true score plus error of measurement.

31
3. What are the three sources of inconsistency between measurements?
a. The person may actually have changed.
b. The task on the two measurements may be different.
c. The sample of behavior is limited.

4. What are the three ways to determine reliability?
a. Repeating the same test or measure (retest).
b. Administering a second "equivalent" form of the test.
c. Subdividing the test into two or more equivalent parts derived from a single administration.

5. List three sources of variation in performance that tend to reduce the stability of a score.
a. Variation from trial to trial in responding to the task at a particular moment in time.
b. Variation in the individual from one time to another.
c. Variation rising out of the particular sample of tasks chosen to represent a domain of
behavior.

6. What method of determining reliability measures all three of the sources of inconsistency?
Alternate forms with a time interval between measurements.

7. Why are single administration methods used more often than alternate forms?
It is much more practical to determine reliability from a single administration.

8. What is the most general form of single administration method?
Coefficient alpha.

9. What information is needed in order to compute coefficient alpha?
The number of items, the standard deviation of the scores, and the standard deviation of each item.

10. What value is substituted for the standard deviation of each item in the KR-20 formula for computing the
reliability of dichotomously scored test items?
The proportion of people getting each item correct (p) multiplied by (1-p).

11. List three limitations of coefficient alpha and KR-20?
a. They do not measure changes in scores that occur over time.
b. They will overestimate reliability if there is a common core of knowledge needed to answer
groups of items.
c. They will overestimate the reliability of speeded tests.

12. What method of dividing a test into two parts for purposes of determining reliability is most often used?
The odd-numbered items are correlated with the even-numbered items.

13. What must be done to the correlation obtained between two halves of a test to make them comparable to
alternate forms reliability.
The following formula is applied to the results to correct for the effect of reduced test length:

1
AB
tt
AB
r
r
r

32
14. What is the standard error of measurement and how is it computed?
It is the standard deviation that would be obtained from a series of measurements of the same person,
assuming that no change occur in the person as a result of measurement.

1
e X tt
SD r SD

15. What four factors affect reliability?
a. Range of the group.
b. Level of ability in the group.
c. Length of the test.
d. Operations used for estimating reliability.

16. Which method of estimating reliability can be expected to yield the lowest value?
Alternate forms reliability with an interval between test administrations.

17. Why is the reliability of the difference between two scores lower than the reliability of each separate
score?
The error of measurement in each separate score accumulates in the difference score and whatever is
common between the two tests is canceled in the difference score.

18. How do criterion-referenced and norm-referenced tests differ in terms of item difficulty?
Items on criterion-referenced tests tend to be easier (have a higher percentage of persons getting the items
correct).

19. List the three ways that criterion-referenced tests can be interpreted.
a. In terms of mastery or non-mastery of the relevant skill.
b. As an indication of relative mastery.
c. Scores from such a test estimate a persons domain score.

20. What are two methods of estimating the reliability of criterion-referenced tests?
a. Decision consistency between two alternative forms.
b. Through the analysis of variance components.

21. What is item information?
It is the rate at which the probability of getting an item correct is changing at a particular ability level.
Item information is greatest for people whose ability (theta) is at the difficulty level (b) of the item.

22. How is reliability estimated for computer adaptive tests?
The test information is calculated from the sum of the item information functions for the items a person
took. The standard error of estimate for this set of items is

1
inf ormation
e
SD
Test

33
Important Terms

a. Coefficient alpha
b. Internal consistency
c. KR-20
d. KR-21
e. Parallel test forms
f. Split-half
g. Spearman-Brown prophecy formula
h. Standard error of measurement
i. True score
j. item information
k. test information


____ 1. The computation of reliability that is based on the correlation between two halves of a test (usually
odd versus even items).

____ 2. Estimate of reliability that can be used with dichotomously scored tests. It is computed using the
proportion passing each item to determine item variability.

____ 3. The most general form of internal consistency reliability.

____ 4. The average value of a hypothetical and infinitely large number of administrations of a test

____ 5. The standard deviation of a hypothetical and infinitely large number of administrations of a
test.

____ 6. Estimate of reliability that can be used with dichotomously scored tests. It is based only on
the mean, standard deviation and number of items.

_____ 7. Reliability based on the correlation between two alternate forms of a test.

____ 8. The correction that is used to compensate for the small sample obtained when two halves of a test
are correlated.

____ 9. Reliability based on the degree to which all items appear to be measuring the same
characteristic.

____10. An index of the rate at which the probability of getting an item correct is changing

____11. Used to determine the standard error of measurement for a computer adaptive test.


1. f 4. i 7. e 10. j
2. c 5. h 8. g 11. k
3. a 6. d 9. b
34

c. 1. A test designed to measure introversion proved to have high internal consistency, but to be
minimally related to other measures of introversion. The test, therefore, can be considered:
a. valid and reliable.
b. valid but not reliable.
c. reliable but not valid.
d. neither reliable nor valid.

d. 2. Which of the following could not be true for an aptitude or achievement test?
a. Though it has little face validity it has good statistical validity.
b. Though it has high content validity it has low reliability.
c. Though it has low validity it has high reliability.
d. Though it has low reliability it has high validity.
e. Though it has low reliability it has low content validity.

b. 3. Which traits appear most stable?
a. temperament
b. cognitive ability
c. personality
d. emotional state
e. vocational interest

c. 4. The reliability of a test or measuring instrument indicates the:
a. extent to which it measures what it is supposed to measure.
b. objectivity of the scores it yields.
c. freedom of the test scores from errors of measurement.
d. minimum validity of the test.
e. capacity of the test to discriminate among examinees.

d. 5. Test reliability can be determined by:
a. computing the mean and standard deviation.
b. correlating the scores made on a mid-term with a final exam.
c. determining the correlation between a test score and a criterion measure.
d. correlating scores from two forms of the same test.
e. correlating test scores with another measure of the same construct.

c. 6. A company administers a keyboarding test to prospective candidates for a secretarial position. In
taking the five minute test, the page from which Bob is typing slips and he loses his place. His
performance on the test therefore underestimates his ability. This is an example of unreliability
resulting from:
a. the person changing.
b. the task changing.
c. a limited sample of behavior.
d. the generalizability of errors.
e. the stimulus error.

35
a. 7. The standard error of measurement is best described as the standard deviation of a distribution of:
a. repeated measurements of a single individual.
b. scores for a homogeneous group.
c. scores for a single form of a test.
d. differences between scores on two test administrations.
e. scores for individuals divided by the group standard deviation.

a. 8. The standard error of measurement is most closely related to:
a. reliability.
b. validity.
c. the mean of the test.
d. the shape of the distribution.
e. the person's deviation from the mean.

b. 9. The standard error of measurement is used instead of a reliability coefficient when:
a. the main concern is with validity.
b. we want to know how much confidence to place in a score.
c. the scores are not normally distributed.
d. the sample size is small.
e. there probably is error in our predictions.

b. 10. The accuracy of a specific person's score is best judged from:
a. a split-half coefficient.
b. the standard error of measurement.
c. the difference between that pupil's score on two forms of the test.
d. the correlation of that test with a different test of the same trait.
e. the standard error of estimate.

d. 11. The computation of reliability by comparing scores from the administration of the same test at two
different sittings is called:
a. parallel forms.
b. alternate forms.
c. first test-second test.
d. test-retest.
e. consistency reliability

c. 12. The most likely difficulty in the use of parallel forms reliability is:
a. its tendency to overestimate reliability.
b. a conflict with classic theories of testing.
c. the impracticality of constructing two forms of the same test.
d. that it tells us nothing about validity.
e. that the errors of measurement will not be due to common content.

b. 13. Which of the following would be appropriate for estimating the reliability of a 2-minute test of
clerical speed and accuracy?
a. Coefficient alpha.
b. Form A vs. Form B.
c. Split-half.
d. Kuder-Richardson 20.
e. The Spearman-Brown adjusted reliability.

36
d. 14. If items with the lowest correlation with the total score are eliminated, a test will then most likely
exhibit:
a. lower validity.
b. greater difficulty.
c. lower difficulty
d. greater internal consistency.
e. lowered reliability.

b. 15. Internal consistency reliability should not be used with:
a. achievement tests.
b. speeded tests.
c. power tests.
d. locally constructed tests.
e. tests of homogeneous constructs.

c. 16. The correlation between the two halves of a spelling test is .40. From this, the best estimate is that
the reliability of the complete test is:
a. certainly above .80.
b. about .80.
c. about .60.
d. also .40.
e. certainly below .40.

b. 17. J acob's score on an achievement test is 75. His standard error of measurement for the test is
reported to be five points. What are the chances that his true score is between 70 and 80?
a. About 9 chances in 10.
b. About 2 chances in 3.
c. About 1 chance in 2.
d. About 1 chance in 3.
e. About 1 chance in 6.

b. 18. Michael's score on a test is 60. The standard error of measurement of this score is three points.
From this information, one may conclude that the chances are about:
a. 1 in 2 that his true score is included by the range of scores from 57 to 63.
b. 95 out of 100 that his true score is included by the range of scores from 54 to 66.
c. 99 out of 100 that his true score is included by the range of scores from 54 to 66.
d. 50 out of 100 that his true score is either 59, 60, or 61.
e. 50-50 that the error is less than 5 points.

a. 19. The reliability of a reading test for fourth-grade pupils is reported to be .90. From this information
one can determine:
a. the extent to which each pupil will maintain his or her position in the group if an
equivalent test is given.
b. how many points pupils are likely to change, on the average, if an equivalent test is given.
c. the degree to which the test is measuring the important aspects of reading.
d. the extent to which the test is related to other significant factors in the individual.

37
d. 20. Which condition is most necessary for internal consistency reliability?
a. Construct stability.
b. Similarity in forms.
c. Consistency among examinees.
d. Variability among examinees.
e. A small number of heterogeneous items.

c. 21. An article in a psychological journal reported, "The reliability of the personality ratings was .55.
The reliability of the supervisor's ratings on the job was .48. Personality ratings correlated with the
measure of job success to the extent of .65." Why should you automatically suspect that there was
an error in the report?
a. It would be unusual to have the reliability of ratings be this low.
b. A personality rating could not be more reliable than a measure of job success.
c. The correlation between two tests cannot be higher than the reliability of the more reliable test.
d. It would be very rare to obtain a validity coefficient this high.
e. The correlation clearly has not been corrected for attenuation.

b. 22. If you administered a test of English grammar to all of the students in a high school, which group
would produce the highest reliability coefficient?
a. Below average ninth graders.
b. Average tenth graders.
c. Above average eleventh graders.
d. Low achieving students in the school.
e. Students in the school's remedial reading program.

e. 23. Which of the following procedures is most likely to give one a distorted impression of the
reliability of a test?
a. Combining data from several different communities.
b. Basing the reliability on test-retest procedures.
c. Reporting reliabilities for boys and girls separately.
d. Computing the reliability for boys and girls combined.
e. Computing the reliability for a group including several grades.

a. 24. If a test is extremely easy for a given group, we would expect the reliability to be:
a. low.
b. moderate.
c. high.
d. unaffected by this fact.
e. highest for the best students.

a. 25. Including too many easy items on a test has the effect of:
a. shortening the test.
b. making it fairer for all students.
c. increasing variability.
d. making the test a measure of mastery.
e. decreasing the influence of measurement error on differences between examinees.

38
d. 26. The reliability of a test would most certainly be increased if one increased the:
a. homogeneity of the group tested.
b. variety of types of items on the test.
c. number of persons tested.
d. length of the test.
e. time limit for the test.

c. 27. Which of the following facts is most important to know if we want to compare data on the
reliability of two different achievement tests?
a. The number of items on each test.
b. The number of different types of items.
c. The spread of ability in the two groups on which the reliability coefficient was based.
d. The size of the groups on which each reliability coefficient was based.
e. The correlation between the two tests.

d. 28. A reading test and an intelligence test each have given reliability coefficients of .90. The
correlation between them is .80. A school guidance worker wishes to use the difference between
standard scores on the two tests as a measure of reading retardation. The reliability of this
difference score is approximately:
a. .90
b. .80
c. .70
d. .50
e. .30

d. 29. When the correlation between two tests, A and B, approaches the reliabilities of the tests, the
reliability of the difference score, (A - B),
a. approaches 1.00.
b. goes up.
c. equals the correlation between A and B.
d. approaches .00.
e. cannot be determined.

b. 30. How does the reliability of the difference between two aptitude test scores typically compare with
the reliability of the scores themselves:
a. Difference scores usually have a higher reliability.
b. Difference scores usually have lower reliability.
c. The reliability of the difference score is usually the same as the lower of the two reliabilities.
d. The reliability of the difference score is usually the same as the higher of the two reliabilities.
e. There is no systematic relationship.

c. 31. How will low reliability in the measurement of two traits affect the size of the correlation that we
will obtain between measures of the traits?
a. The correlation will be increased.
b. There will be no systematic effect on correlation.
c. The correlation will be reduced.
d. It is impossible to predict how the correlation will be affected.

39
40
e. 32. Traditional methods of computing reliability do not work well with mastery tests because:
a. mastery tests are too variable.
b. items on such tests are too dissimilar.
c. mastery tests usually are of low quality.
d. there are too few items on such tests.
e. substantial proportion of examinees may get perfect scores.

d. 33. Which method of determining the reliability of mastery tests is most often used?
a. split halves.
b. KR-20.
c. maximum likelihood.
d. decision consistency.
e. internal consistency.

e 34. The information that a test item yields about an examinee is directly related to
a. its a-parameter
b. its b-parameter
c. its c-parameter
d. the height of the item characteristic curve above the base line
e. the slope of the trace line at the examinees ability level

b 35. If the difference between b and theta is large, we can be sure that
a. the item has high reliability
b. the item has low information
c. the test is unreliable
d. the test is too difficult for this person
e. the item trace line is has a slope near zero

b 36. Which of the following is true of the standard error of measurement of scores from a computer
adaptive test?
a. It will be higher than the standard error of measurement of a traditional test of the same length
b. It is an inverse function of the test information
c. It is independent of test length
d. It can only be computed if the person takes all of the items
e. It cannot be computed because no two people take the same test

Chapter 5 Qualities Desired in Any MeasuChapter 6: Practical Issues Related to Testing
Chapter 5: Qualities Desired in Any Measurement Procedure: Validity

Chapter Outline

A. Introduction.
1. Validity is specific to use
2. There is a chain of inference from test to interpretation
3. Types of evidence of validity.
B. Content-related evidence of validity.
1. Specifying contents and processes to be covered.
a) Information content
b) Process content
2. Preparing the test blueprint.
a) Relative emphasis of content areas and process objectives.
b) Type of items to be used.
c) Total number of items for the test.
1) Type of item
2) Age and education of examinees
3) Ability of students
d) Determining item distribution.
e) Appropriate level of difficulty of the items.
1) Determining difficulty
2) Ideal average difficulty
3) Content validity primary concern for achievement tests
f) Measuring a common core of objectives.
g) Need to examine actual test tasks.
3. Content validity for aptitude and typical performance measures
C. Criterion-related evidence of validity.
1. Face validity.
2. Empirical validity.
a) Predictive validity.
b) Concurrent validity.
c) The problem of the criterion.
1) Most criteria are flawed.
2) Actual production is influenced by the efforts of others.
3) Ratings tend to be dependent on the person doing the ratings.
4) All criteria tend to be only partial.
d) Qualities desired in a criterion measure.
1) Relevance.
2) Freedom from bias.
3) Reliability.
4) Availability.
e) Making predictions
1) The regression equation
2) The regression line
41
3) Finding the regression line
f) Interpretation of validity coefficients.
1) The higher the correlation the better.
2) The amount of new information provided.
3) Dependent on the level of selectivity.
g) Base rates and prediction.
1) How we define success
2) The selection rule we use.
3) The value of a correctly identified success.
4) The "cost" of accepting someone who subsequently fails.
5) The "cost" of missing a candidate who would have succeeded.
h) Standard error of estimate.
1) Even with good measure predictions are approximate.
2) Determining the prediction interval.
3) Effect of reliability on both the predictor and the criterion.
4) Effect of restricted range.
5) Importance of context.
D. Construct-related evidence of validity.
1. Predictions about correlations.
a) Factor analysis
b) Multitrait, multimethod analysis.
1) Reliability diagonals
2) Validity diagonals
3) Convergent validity
4) Discriminant validity
2. Predictions about group differences.
3. Predictions about response to experimental treatments or interventions.
1) Establishment of a network of theory.
2) Emphasis on inferences.
E. The unified view of validity
1. Validation of inferences
a) Interpretive inferences
b) Action inferences
c) Score meaning implies values
2. Validation as a scientific enterprise
a) Validation as hypothesis testing
b) Ruling out competing explanations
3. Construct validation as the whole of validity
a) Score-based inferences
b) Threats to construct validity
1) Construct underrepresentation
2) Construct-irrelevant variance
c) Internal and external components
1) Definition of the construct
2) Relations with other variables
42
d) Plausible rival hypotheses
4. Messicks unified theory of validity
a) The four-fold model
1) J ustification facet
2) Function facet
b) Value implications
1) Meaning of labels
2) Ideology
c) Social consequences of testing
1) Intended
2) Unintended
5. Beyond Messick Validation as evaluative argument
F. Validity theory and test bias
1. Bias
2. Fairness
G. Overlap of reliability and validity.
1. Repeatability of same test is defined by reliability.
2. Repeatability of prediction involve both reliability and validity.
3. Generalizability theory.
H. Validity of criterion-referenced tests.
1. Emphasis on content validity.
2. Importance of conditions of testing.
3. Use of teacher judgments.
4. Comparison of those who have received instruction with those who have
not.
I. Meta-analysis and validity generalization
J . Summary.


1. What three types of evidence are used to establish validity?
a. Content-related.
b. Criterion-related.
c. Construct-related.

2. What are two alternate names for content validity?
Rational and logical validity.

3. With what type of test is content-related validity most often used?
With measures of achievement.

4. What is the most important step in the process of establishing content validity?
Examining the actual test items.

5. With what type of test is criterion-related validity most likely to e used?
Aptitude tests.
43
6. What type of evidence is emphasized with criterion-related validity?
Empirical or statistical evidence.

7. What is face validity and why is it necessary?
Face validity refers to what the test appears to measure. It is important to ensure that test takers will have
confidence that the test measures what they have been told it measures.

8. What is the biggest problem in establishing criterion-related validity?
Finding an appropriate criterion.

9. What are the desired qualities in a criterion measure?
Relevance, freedom from bias, reliability, and availability.

10. What evidence is most often used to establish the relevance of a criterion?
The opinion of judges.

11. What factor determines how high criterion validity must be before it increases the ability to make good
decisions?
The proportion of candidates that can be accepted (selection ratio).

12. What is the standard error of estimate?
The spread of scores around the mean criterion performance of individuals with the same predictor score.

13. What is the effect of restricting the range of either the criterion or predictor variable?
It can dramatically lower criterion-related validity.

14. With what type of measurement instruments is construct-related validity most often used?
Personality scales. In many cases it is the only method for validating these instruments, although it is an
appropriate and necessary component in the establishment of the validity of any test.

15. What types of predictions are used in the establishment of construct-related validity?
a. Predictions about correlations.
b. Predictions about group differences.
c. Predictions about responses to experimental treatments or interventions.

16. What is a multitrait, multimethod analysis, and what is it used for?
Two or more different traits are each measured by two or more methods. The purpose is to determine
whether different measures of the same trait correlate more highly with each other than do measures of
different traits obtained using the same measurement method. It provides evidence of construct validity.

17. According to Messick, what is validated in construct validity?
The interpretation of test scores rather than the test itself.

18. How are test validation and the scientific method similar?
Both involve testing hypotheses.

19. What are the two primary threats to construct validity?
a. construct underrepresentation
b. construct-irrelevant variance

44
20. What is test bias?
Test bias occurs when one group scores higher or lower than another group due to construct
underrepresentation or the presence of construct-irrelevant variance.

21. What type of validity is typically emphasized with criterion-referenced tests?
Content validity.

22. What is validity generalization?
A form of meta-analysis in which evidence for predictive validity of a test is combined across studies and
extended to new, similar situations.

Important Terms

a. Attenuation h. Mastery
b. Base rate i. Rational validity
c. Construct-related validity j. Restriction in range
d. Content-related validity k. Selection ration
e. Criterion-related validity l. Standard error of estimate
f. Face validity m. Validity generalization
g. Hit rate n. Construct underrepresentation
o. Construct-irrelevant variance


____ 1. Lowered validity resulting from the unreliability of the criterion.

____ 2. Lowered reliability resulting from low variability in a criterion.

____ 3. Standard deviation of the scores of individuals with the same predictor score.
____ 4. The proportion of a group of applicants that would have succeeded if all were admitted.

____ 5. The proportion of correct decisions plus the proportion of correctly rejected failures.

____ 6. The degree to which a test looks like it is measuring what it is supposed to measure.

____ 7. The degree to which a test can make accurate predictions.

____ 8. Determination that a student has achieved an objective.

____ 9. The extent to which appropriate objectives are represented on a test.

____10. Another name for content validity.

____11. The degree to which establishing the legitimacy of inferences for one purpose can justify the use of
an instrument for other purposes.

____12. The determination of what a score means.

____13. The proportion of individuals that must be accepted.
45
____14. Important aspects of the construct are not included in the test material or testing situation

____15. Factors such as speed of performance, that are not part of the construct definition, affect test
performance


1. a 4. b 7. e 10. I 13. k
2. j 5. g 8. h 11. m 14. n
3. l 6. f 9. d 12. c 15. o


d. 1. Which characteristic of a test is most important?
a. stability.
b. consistency.
c. reliability.
d. validity.
e. practicality

e. 2. The way validity is viewed has changed in the last 20 years. It is now viewed as:
a. a unitary characteristic of the test.
b. categorical.
c. a characteristic of the test.
d. an objective rather than a subjective trait.
e. a property of test use rather than the test.

d. 3. Validity focuses on the:
a. test items.
b. construct.
c. domain of observables.
d. uses of the test scores.
e. qualifications and training of test users.

a. 4. Which of the following would NOT be likely to be used in appraising the content validity of a high
school level standardized achievement test in English?
a. Correlations with college marks.
b. Analysis of the content of high-school English textbooks.
c. Examination of recommendations by the Modern Language Association.
d. Pooled judgment of a group of experts.
e. Results from a teachers focus group

a. 5. Restricting the items on a standardized achievement test to the course contents and learning
outcomes typically found in the majority of schools is necessary to ensure:
a. content-related validity.
b. test reliability.
c. criterion-related validity.
d. construct related validity.
e. face validity.

46
b. 6. In evaluating the content validity of a test, it is of primary importance to examine:
a. validity coefficients.
b. the actual items on the test.
c. reliability.
d. a description of the content of the test.
e. research that has been done with the test.

c. 7. With which type of test would you be most interested in predictive validity?
a. Personality tests.
b. Achievement tests.
c. Aptitude tests.
d. Vocational interest tests.
e. A test measuring attitude toward methadone treatment.

a. 8. For which of the following tests would one be most exclusively interested in predictive validity?
a. A biographical data blank used for selecting airline pilots.
b. A measure of attitudes towards Communism used in a political science class.
c. A diagnostic test of reading comprehension used with fourth graders
d. An introversion-extroversion questionnaire used with personality research.
e. A mathematics achievement test used in high school geometry.

d. 9. Which of the following is usually the biggest problem in establishing the predictive validity of a
test?
a. Devising a test with good items.
b. Administering the test under uniform conditions.
c. Writing a sufficiently large sample of items.
d. Obtaining a good criterion measure.
e. Cost.

a. 10. A test constructor, validating a test to be used in the selection of people to sell insurance, computed
correlations between the test and:
I. ratings by supervisors.
II. rankings on sales volume.
III. scores on a group intelligence test.
The combination of which of the above would provide the best basis for judging the validity of the
newly constructed test?
a. I and II.
b. I and III.
c. II and III.
d. All three.

47
a. 11. A test-construction agency is preparing a survey test of science achievement in high schools. The
following have been proposed as giving useful evidence the validity of the test.
I. Correlating the score with the grades students had earned in high-school science courses.
II. Having experts judge whether the test covers the objectives of science instruction.
III. Giving two forms of the test and seeing whether scores are consistent from one administration
to the other.
Which would actually provide direct evidence about test validity?
a. I and II.
b. I and III.
c. II and III.
d. All of them.

d. 12. A research program is being established to develop tests to use in the selection of bank tellers. It is
probable that the test constructor will encounter the most difficulty in:
a. selecting promising tests to try out.
b. getting the cooperation of a group of subjects.
c. working out statistical procedures for determining test validities.
d. obtaining satisfactory measures of job success for each employee.
e. obtaining permission from a court to use tests that discriminate between successful and
unsuccessful employees.

d. 13. it would be most difficult to identify a criterion variable for a test measuring
a. scholastic aptitude.
b. classroom achievement in social studies.
c. vocational aptitude in engineering.
d. attitude toward methadone treatment.
e. interest in a career as an airline pilot.

d. 14. Which of the following is the best indication of the validity of a test as a measure of aptitude for
foreign language? The extent to which:
a. all children without foreign-language training achieve the same score on the test.
b. scores on the test correlate with previous language training.
c. equal amounts of training in foreign language produce equal changes in test scores.
d. scores on the test at the beginning of the school year correlate with achievement in foreign
language at the end of the year.
e. scores on the test correlate with expressed interest in foreign countries.

c. 15. Which of the following statements best describes a criterion that is free of bias?
a. A criterion on which members of different ethnic groups obtain the same score.
b. A criterion that functions independently from the culture of the person being tested.
c. A criterion on which each equally capable persons obtain the same score.
d. A criterion that has been evaluated by a panel made up of members from diverse
cultural-gender-ethnic judges.

d. 16. The determination of whether a validity coefficient is large enough, needs to be based on:
a. the reliability of both the test and criterion.
b. the number of items on the test.
c. what the test is measuring.
d. what the test is being used for.
e. how many people will be tested.

48
a. 17. The College Board Scholastic Assessment Test is required for admission by both College P and
College Q. College P is quite selective and has room for only about 25 percent of its applicants,
while College Q admits about 75 percent. With which college is the test likely to be more useful?
a. College P.
b. College Q.
c. Same in both.
d. The selection rate is too low to be effective in either.

b. 18. The most important lesson derived from examinations of errors of estimate is:
a. the critical importance of reliability.
b. that even with the best measures, predictions are approximate.
c. their independence from standard errors of measurement.
d. their instability.
e. that precise predictions are possible only for a limited number of variables.

c. 19. Predictive validity for the Graduate Record Exam (GRE), established by correlating these scores
with grade point average (GPA) in college is quite low. This is probably because:
a. the content of the test is irrelevant to success in graduate school.
b. the GRE is unreliable.
c. the range of GPA is quite narrow.
d. the GRE is biased.
e. the wrong criterion measure is being used.

e. 20. A civil service test used to screen prospective policemen that includes law enforcement
terminology in a vocabulary test, has been constructed so it will have:
a. content validity.
b. criterion validity.
c. construct validity.
d. empirical validity.
e. face validity.

d. 21. For which of the following tests would construct validity be most important?
a. A proficiency test for aviation mechanics.
b. A selection test designed to select sales persons.
c. An achievement test in high-school social studies.
d. A test designed to appraise the trait of introversion.
e. A measure of vocational interests.

e. 22. The establishment of construct validity is appropriate:
a. only for personality tests.
b. for tests for which content or criterion-related validity is inappropriate.
c. when the construct is assumed to be homogeneous.
d. when it is suspected that only one construct is being measured.
e. for all tests.

c. 23. In order to establish construct validity, it is necessary to show that:
a. the test as a whole has a high correlation with a criterion variable.
b. the items on the test are heterogeneous.
c. the test measures what its author intended for it to measure.
d. test scores are stable over time.
e. the construct has important psychological or educational meaning.
49
e. 24. Which of the following would be used for construct validation?
a. Correlations.
b. Predictions concerning group differences.
c. Predictions about the effects of experimental treatments.
d. Expert J udgment.
d. All of the above.

a. 25. A low correlation would be considered evidence for the construct validity of a test if:
a. the correlation was with a measure of a different trait.
b. the test is a new, experimental forms.
c. the criterion measure has a low reliability.
d. the sample tested is quite small.
e. None of these. A low correlation is not evidence of constuct validity.

b. 26. Which type of validity would be most important for criterion-referenced tests?
a. concurrent
b. content
c. construct
d. face
e. predictive

e. 27. Combining the results of several studies of a test's predictive validity to get a better estimate of how
well the test will work in varied settings is known as
a. construct validation.
b. content representation.
c. meta validation.
d. prediction extension.
e. validity generalization.

e. 28. Monotrait, monomethod coefficients provide evidence about a test's
a. convergent validity.
b. discriminant validity.
c. construct validity.
d. validity generalization.
e. reliability.

a. 29. When the monotrait, heteromethod correlations are high, we are likely to conclude that
a. the test demonstrates convergent validity.
b. the test is reliable.
c. a mistake has been made in calculating the heteromethod coefficients.
d. the test has good discriminant validity.
e. trait definition is poor.

d. 30. To say that a test has discriminant validity is to imply that it
a. discriminates between people who have different amounts of the trait.
b. is biased.
c. must be highly reliable.
d. shows low correlations with measures of other traits.
e. shows high correlations with other measures of the same trait.

50
51
a. 31. Which statement about the test blueprint is most correct?
a. Even an attempt to construct a test blueprint can be useful.
b. If the blueprint is not complete it is not really worth doing.
c. The blueprint should only include behaviorally stated objectives.
d. The blueprint should make clear that a test is more than a sample of behavior.
e. The blueprint provides the primary evidence of face validity.

Chapter 6: Practical Issues Related to Testing

Chapter Outline

A. Factors leading to practicality in routine use.
1.Economy.
a) Actual cost.
b) Time of test administration.
c) Ease of scoring
2. Computer scoring
a) The scanning process
b) Scoring the responses
c) Scoring essays by computer
3. Computerized test interpretation
4. Features facilitating test administration.
a) Clear, full instructions.
b) Few separately timed sections.
c) Layout of items.
5. Features facilitating interpretation and use of scores.
a) Statement designating the function of the test.
b) Detailed instructions.
c) Scoring keys and special instructions for scoring.
d) Norms for appropriate reference groups.
e) Evidence of reliability.
f) Intercorrelations among subscores.
g) Correlations between the test and other variables.
h) Guides for test use and interpretation.
6. E-testing
a) Advantages
b) Risks and problems
B. Guide for evaluating a test.
1.General identifying information.
2. Information about the test.
3. Aids to interpreting test results.
4. Evidence of validity.
5. Reliability data.
6. Administration and scoring.
7. Scales and norms
C. Getting information about specific tests
1. What tests exist?
a) Tests in Print.
b) Other published and web-based sources.
c) The ETS Test Collection via the Internet.
2. Exactly what is test X like?
a) Obtaining a specimen set.
52
b) Examining the materials provided.
F. What do critics think of test X?
1. The Mental Measurements Yearbooks.
2. Test Critiques.
3. Other sources.
G. What research has been conducted on Test X.
1. Test manuals
2. Bibliographies in MMY and TIP-4
3. Psychological Abstracts and other abstract services.
4. PsychLit, ERIC, and other databases.
5. Internet searches
H. Summary


1. What aspects of economy are relevant for determining practicality in routine use?
a. Economy in the cost of test materials.
b. Economy in time of testing.
c. Ease of scoring.

2. What four factors facilitate test administration?
a. Clear and full instructions.
b. Few separately timed units.
c. Timing that is not crucial.
d. An attractive well organized page layout.

3. What information should be included in the test manual to facilitate test interpretation?
a. A statement of the functions of the test.
b. Detailed test administration instructions.
c. Scoring keys and specific instructions on scoring.
d. Norms for appropriate reference groups.
e. Evidence of test reliability.
f. Information about item intercorrelations.
g. Information about the relationship between the test and other variables.
h. Guides for using the test and for interpreting results.

4. What is a final indicator of quality to look for in a tests supporting materials?
The quality of the manual(s) that accompany a test.

5. When seeking a test to fulfill a specific purpose, what is the first type of information that you would need
to obtain?
You would first need to know what tests are already in existence.

6. What are the limitations of Tests In Print and Tests: A Comprehensive Reference for Assessments in
Psychology, Education and Business?
They do not provide information on the most recent tests or about unpublished tests.

53
7. How do you find information and unpublished tests?
Tests in Microfiche and various directories to unpublished tests. Information may also be available in
Dissertation Abstracts, Psychological Abstracts or the Education Index.

8. How do you find out what a test is really like?
The first source you should consult is the test itself. This is the only way to really understand what a test
is like. Information of this sort can also be found in the Mental Measurement Yearbooks.

9. Where can you find information about what critics have to say about tests?
The first source that you should automatically turn to is the Mental Measurement Yearbook. Another
source of information is TEST Critiques. Reviews of tests also appear sporadically in psychological and
educational journals.

10. Where can you obtain information about the research that has been conducted on a test?
One source are the manuals that accompany tests. They generally include bibliographic information on
existing studies. The Mental Measurement Yearbooks contain reference lists of studies that have been
conducted on tests as well. Other sources include Dissertation Abstracts International, the Education
Index, and the Psychological Abstracts. A useful recent resource is the Index to Tests Used in
Educational Dissertations.

11. Why are test manuals not always good sources of information about a test or references about the test?
The publisher is unlikely to include anything in such a manual that might make any potential purchaser
not want to adopt the test. As a result, the information they provide tends to be one sided.

Important Terms

a. The actual test you plan to use.
b. Current publisher's catalogs.
c Mental Measurement Yearbook (various editions)
d. News on Tests
e. Psychological Abstracts
f. Tests: A Comprehensive Reference for Assessment in Psychology, Education, and Business
g. Tests and Measurement in Child Development
h. Tests in Microfiche
i. Tests in Print


____1. A source of 1000 unpublished instruments intended for use with young children.

____2. A first source you would use when you want to find out about what published tests are
available.

____3. An important source of information about research on tests.

____4. A service provided by the Educational Testing Service Collection which makes a large number of
unpublished tests available to libraries.

____5. A source of information about recently published tests.
54
____6. Best source of information on what experts say about a test.

____7. The most important source of information about the contents of a test you are considering
adopting.

____8. A newsletter from ETS which provides announcements about new tests.

____9. A source of information about tests that provides the name, publisher, address, and a short
description of the test.


1. g 4. h 7. a
2. i 5. b 8. d
3. e 6. c 9. f


a. 1. Which of the following test features facilitates test administration?
a. Practice items.
b. Several parts timed separately.
c. Response choices that are close together.
d. Novel items types.
e. The need for precise timing of the test.

c. 2. Which of the following sources would be most likely to provide information about new tests?
a. Measurement textbooks.
b. Publishers' manuals.
c. Publishers' catalogs.
d. Compilations of available tests.
e. Reviews of tests in the MMY.

a. 3. A student wanting to find out the most recently published achievement tests in arithmetic might
best consult:
a. publisher's catalogs.
b. Buros--The Mental Measurement Yearbooks.
c. The J ournal of Educational Measurement.
d. Tests in Print.
e. The J ournal of Mathematics Education.

55
d. 4. Which two of the following, used jointly, would enable one to prepare efficiently a complete and
up-to-date list of available arithmetic tests?
I. Education Index.
II. Publisher's catalogs.
III. Review of Educational Research.
IV. Tests in Print.
a. I and II.
b. I and IV.
c. II and III.
d. II and IV.
e. III and IV.

b. 5. In which journal should one look for reviews of new tests?
a. Contemporary Psychology.
b. J ournal of Educational Measurement.
c. J ournal of Educational Research.
d. Psychological Abstracts.
e. Mental Measurements Yearbook
a. 6. Which of the following sources would be most likely to provide information on unpublished tests?
a. Tests in Microfiche
b. J ournal of Educational Measurement
c. Mental Measurements Yearbook
d. Test Critiques
e. Testing Abstracts.

c. 7. Which of the following sources would be most useful for finding statistical information concerning
the norms of a particular standard test?
a. J ournal of Consulting Psychology.
b. The publishers catalog.
c. The manual of the test.
d. A book on statistical methods in education.
e. A book on educational and psychological measurement.

d. 8. Assume that you are responsible for selecting tests for your school district. Which of the following
sources must be consulted before a test is selected?
a. compilations of available tests.
b. Publishers catalogs.
c. Measurement textbooks.
d. The test manuals for the tests your are considering.
e. The testing experts at a nearby university.

c. 9. The most serious limitation to using the test manual to evaluate a particular test is that the manual
is likely to:
a. give only out-of-date information.
b. give too little data to be useful.
c. present a biased and partisan picture.
d. be too technical for most test users.
e. be difficult and expensive to obtain.

56
57
b. 10. A student who wanted to find out what validation studies had been done on the Kuder
Occupational Interest Survey in the last two or three years should go to:
a. Buros--The Mental Measurement Yearbooks.
b. Psychological Abstracts.
c. Encyclopedia or Educational Research.
d. Review of Educational Research.
e. Contemporary Psychology.

d. 11. Which of the following provides the best source for critical reviews of tests?
a. Measurement textbooks.
b. Professional journals.
c. Compilations of research abstracts.
d. The Mental Measurement Yearbooks.
e. Tests in Print.

a. 12. A student who wanted to find critical reviews of the Iowa Tests of Basic Skills might best consult:
a. Buros--The Mental Measurement Yearbooks.
b. Measurement and Evaluation in Guidance.
c. The manual of the test.
d. A book on statistical methods in education.
e. Tests in Print.

e. 13. Which of the following would a user find in Buros' Twelfth Mental Measurement Yearbook?
a. Critical reviews of tests by disinterested "experts."
b. Reviews of recent books dealing with tests and measurements.
c. Information on cost, publication date, administration time, etc., for a test.
d. A list of test publishers.
e. All of these.

d. 14. For which of the following purposes would one be most likely to find the Education Index useful?
a. To prepare a complete list of available intelligence tests.
b. To get critical evaluations of Intelligence Test A.
c. To find out the publisher and cost of Intelligence test A.
d. To locate research studies on a problem in intelligence testing.
e. All of these.

Chapter 7: Achievement Tests and Educational Decisions

Chapter Outline

A. Values and decisions.
B. No Child Left Behind.
1. Overview of NCLB.
2. Standards and assessment.
3. Accountability.
C. Placement decisions.
1.Assigning students to a within class group.
2. Issues related to mainstreaming
a) Right to least restrictive setting
b) Individualized educational programs (IEPs)
3. How placement decisions are made.
a) Use of initial achievement measures.
b) General level of academic performance.
c) Learning disabilities and emotional disturbances.
D. Classroom instructional decisions.
1. Formative vs. summative evaluation.
2. The use of objectives.
3. Types of assessment instruments.
a) Standardized achievement tests.
b) Assessment material packaged with curriculum materials.
c) Teacher made assessment instruments.
1) Pencil and paper tests.
2) Oral tests.
3) Product evaluations.
4) Performance assessment.
5) Affective measures.
E. Day-by-day instructional decisions.
1. Using test performance as a basis for altering instruction.
2. Using tests with subject matter that has a hierarchal structure.
3. Tendency of instructional objectives to be fuzzy.
F. Reporting academic progress.
1. Performance in relation to perfection.
2. Performance in relation to par.
a) Use of percentiles or standard scores.
1) Inappropriate if the spread between the highest and lowest
performance is small.
2) Forces half of the group to be below average.
3) Standardized achievement tests interpret par in terms of a
national sample.
3. Performance in relation to potential
a) Potential is difficult to determine.
58
b) It is inappropriate if performance and potential are highly correlated.
4. Assigning grades
5. Importance of grades
a) Need for grades
b) Who uses grades?
G. Planning educational futures.
1. Using present measured performance to predict future achievement.
2. Using grades as an indicator of achievement.
3. Avoiding mistakes when using test performance to predict the future.
a) Premature decision making.
b) Closing out options.
H. Selection decisions.
1. Making decisions that are of utility to organizations.
2. Using past achievement or present performance as predictors of future
performance.
I. High-stakes decisions
1. High-stakes testing.
2. Minimum competency.
3. Curricular validity.
4. Instructional validity.
5. High stakes testing and NCLB
J . Curricular decisions.
1. Evaluating curricular change
2. Choosing measures
3. Data collection strategies
J . Public and political decisions.
1. What the public believes
2. Summarizing results for public consumption
K. Summary.


1. What is the difference between formative and summative evaluation?
Evaluation of student performance for use by teachers to improve instruction is called formative evaluation.
The more structured assessments associated with assigning grades or making placement decisions are called
summative evaluation.

2. What type of assessment techniques do most teachers favor?
Informal rather than formal assessments.

3. What is the purpose of No Child Left Behind?
To ensure that all children achieve important learning objectives while being educated in safe classrooms by
highly qualified teachers.

59
4. What are the major goals for NCLB?
a. Universal proficiency
b. Highly qualified teachers
c. Safe environments that are conducive to learning
d. All students will develop proficiency in English
e. All students will graduate from high school

5. What are the major principles of NCLB?
a. Local accountability for results
b. Research based instructional methods
c. Increased local flexibility in use of federal funds
d. Parental choice

6. What are some disadvantages of standardized achievement tests?
a. They can only assess the general objectives that can be found in the curriculum of most school
districts.
b. They can only assess the sort of cognitive skills that can be assessed by pencil and paper tests.
c. They require a lot of time.
d. There may be a large gap in time between when the test is administered and when it is returned to
the teacher.

7. Why should tests packaged with instructional materials be examined carefully before being used?
The tests packaged with instructional materials are often poorly prepared.

8. List the five general methods for collecting data on the achievement of instructional objectives.
a. Paper and pencil tests.
b. Oral tests.
c. Product evaluations.
d. Performance tests
e. Affective measures.

9. List two situations in which placement decisions are usually made.
a. For gifted students whose level or speed of learning greatly exceeds that of other students.
b. For students who are having a great deal of difficulty in a regular class.

10. What have educational researchers found out about the degree to which placing students in different
educational treatments can enhance learning?
The results are mixed and sometimes contradictory. Some studies support this practice and others either fail
to show any educational gains or indicate poorer performance.

11. Under what conditions are tests most likely to be useful in making placement decisions?
When they assess specific entry knowledge and skills for a particular subject matter area.
When the alternative instructional treatments differ.

12. List three ways of describing a childs performance in school.
a. Performance in relation to perfection.
b. Performance in relation to par.
c. Performance in relation to potential.

60
13. List the two main problems associated with reporting student performance in terms of par.
a. There may be too little variability within a class and therefore distinctions may be made where they
do not really exist.
b. Classes may be so dissimilar that comparisons between classes cannot be made.

14. What is the disadvantage to reporting the results of student achievement in terms of specific objectives
mastered?
Parents and members of the public may find such reports difficult to interpret without information about
which objectives are typically mastered at a certain age.

15. Why do we need grades?
a. They provide information and motivation to students.
b. They inform parents about the academic progress of their children.
c. They certify the level of achievement that a student has reached.

16. On what should grades be based?
On as pure and accurate a measurement of achievement as possible.

17. List two assumptions associated with the reference to perfection approach to grading that are particularly
difficult to meet.
a. That 100% performance represents a meaningful entity.
b. That all tests are of equal and appropriate difficulty.

18. Why do so many teachers continue to use the reference to perfection approach to grading?
Due to lack of training in other methods. In many cases this is the only approach they have been exposed to,
they saw it used when they were students and it is the one used by fellow teachers. It also is an approach
that students and their parents will accept without complaining.

Important Terms from Chapter 6:

a. Affective measures g. Performance in relation to par
b. High stakes decisions h. Performance in relation to perfection
c. Performance assessment i. Performance in relation to potential
d. Item sampling j. Product evaluations
e. Oral tests k. Reference to perfection
f. Pencil and paper tests


____1. A norm-referenced approach to reporting academic progress.

____2. Each student answers a subset of items from a larger pool.

____3. Assessing students by having them respond orally.

____4. Decisions that are of great importance.

____5. Assessing how well a student carries out a required sequence of steps.

61
____6. Assigning grades according to a set percentage of achievement (an A =93-100, B =86-92, etc.).

____7. Assessments of personality characteristics.

____8. The kind of concrete objective tests that teachers typically use.

____9. The evaluation of a painting or science fair project.

____10. Evaluating students in terms of the gap between their aptitude and achievement.

____11. Reporting that a student obtained an 80% in history.


1. g 4. b 7. a 10. I
2. d 5. c 8. f 11. j
3. e 6. k 9. j


e. 1. Which type of evaluation do most teachers prefer?
a. pencil and paper tests
b. standardized achievement tests
c. formal methods
d. aptitude tests
e. informal methods

d. 2. Which of the following is NOT an important limitation of standardized achievement tests?
a. They must focus only on objectives common to all school districts.
b. They can only include objectives that can be assessed by pencil and paper tests.
c. They require considerable time for administration.
d. They can only measure the ability to recall facts.
e. They are expensive.

a. 3. Which of the following is a strength of teacher-designed classroom tests?
a. The items tend to match the objectives of the class in which they are used.
b. They provide the best information for comparing classes.
c. They generally have good psychometric qualities.
d. They exemplify correct test construction methods.

a. 4. Product evaluations would most appropriately be used to assess:
a. penmanship.
b. cognitive skills.
c. reading comprehension.
d. aptitude for learning.
e. beliefs about the value of education.

62
e. 5. Placement decisions may need to be made for students whose instructional needs different from
others. Consider the following groups of students:
I. gifted students whose level of achievement and speed of learning exceed other
students.
II. students having difficulty learning.
III. students with different learning styles.
With which group are placement decisions most likely to be made?
a. II only.
b. I and II only.
c. I and III only.
d. II and III only.
e. All of the them.

c. 6. Under what conditions are placement decisions most likely to be effective?
a. When general measures of ability are used.
b. When instruction is similar in the different classes.
c. When specific entry knowledge and skills required for success in a class are assessed.
d. When the nature of instructional differences are not known.
e. When the student's level of interest in the subject is considered.

b. 7. What type of information about student progress is normally reported to parents?
a. criterion-reference scores
b. norm-referenced scores
c. self-referenced scores
d. performance test results
e. portfolios

b. 8. Why is it NOT useful to report student performance in relation to potential?
a. It places low performing students at a disadvantage.
b. Tests of potential and achievement tend to differ only in terms of how they are used.
c. There is no accurate way to assess achievement.
d. Scores on measures of potential tend to be unrelated to actual achievement.
e. None of these; it is useful to report performance in relation to potential.

d. 9. Under what circumstances would an aptitude test most effectively select potential employees?
a. In the skilled trades.
b. When personality traits are important.
c. In office jobs.
d. In situations where the focus is on job training.
e. For selecting people to sell insurance.

c. 10. The first requirement for evaluating research on curriculum is:
a. flexibility in the specification of what should be assessed.
b. accurate assessment of each student.
c. clear statements of the objectives of the curriculum.
d. the availability of published tests.
e. accurate assessment of the beliefs of the teachers.

63
d. 11. With which of the following statements about grading would measurement specialists be most
likely to agree?
a. They are based on unnecessary and artificial judgments about student performance.
b. They promote egalitarianism.
c. They are elitist and undemocratic.
d. They operationalize and legitimize decisions based on merit.
e. They provide some of the most accurate assessments of student performance.

b. 12. The primary function that a report card should serve is to:
a. provide a permanent record of pupil achievement.
b. inform the parents about their child's work in school.
c. stimulate each pupil to put more effort into his or her school word.
d. establish educational goals for each pupil.
e. reward students for good performance.

d. 13. Which of the following should be expected to play the largest role in motivating the achievement
related behavior of secondary-school students?
a. End-of-year examinations.
b. Grades given at half-semester reporting periods.
c. Standardized achievement tests.
d. Daily school tasks, and the teacher's appraisal of them.
e. The knowledge that school grades are used to make college admission decisions.

c. 14. The subjects that elementary-school students perceive to be most important are those:
a. emphasized by parents.
b. on which most class time is spent.
c. in which the teacher assigns and scores written work.
d. that seem to be especially difficult.
e. in which they are most interested.

b. 15. As far as most students are concerned, course grades probably are of greatest value in:
a. providing motivation.
b. helping to direct their classroom learning.
c. helping to guide their educational and vocational plans.
d. helping to guide their personal development.
e. providing a reward for their accomplishments.

a. 16. Which of the following is likely to provide the most effective direction and guidance for a student's
learning?
a. Prompt analysis of errors, and a report of them.
b. A monthly grade in each subject.
c. Outlines for midterm and final examinations.
d. A comprehensive year-end examination.
e. Opportunities for parent-teacher conferences.

b. 17. Which of the following types of evidence of course achievement will usually permit the most
reliable appraisal?
a. Participation in group activities.
b. Written examinations.
c. Individual papers and reports.
d. Laboratory and workshop activities.
e. Self evaluations by the students.
64
65
a. 18. The factor that most seriously limits the value of grades for certification is that grades:
a. are not comparable between classes and schools.
b. cannot be assigned for some attributes.
c. emphasize only academic factors.
d. are assigned as letters in some schools and as numbers or percentages in other schools.
e. cannot reflect student performance with sufficient reliability.

a. 19. When a single letter or numerical grade is reported for a school or college course, the grade is most
useful if it represents:
a. a pure measure of the degree to which the student has achieved the course objectives.
b. an evaluation of achievement in the light of the pupil's ability.
c. an evaluation of achievement modified by considerations of effort and interest.
d. a total appraisal of the student's achievement and personality.
e. an evaluation that represents only formally graded written work.

b. 20. The meaning of a grade is best interpreted:
a. according to absolute standards set by the teacher.
b. in the context of the class in which the grade is assigned.
c. on the basis of effort.
d. on the basis of the percentage of material mastered.

e. 21. If the average SAT score of the entering freshman in a college went up 10 points each year for a
five-year period, one would expect the grade point average to:
a. rise a corresponding amount each year.
b. show no change for a year or two, and then start to rise.
c. rise for a year or two, and then level off.
d. rise for a year or two, then return to the previous level.
e. show essentially no change.

b. 22. In deciding on the weight to be given to a particular item of information (i.e. exam, paper, etc.) in
determining a course grade, the primary consideration should be:
a. statistical validity.
b. content validity.
c. reliability
d. objectivity.
e. the amount of work involved.

b. 23. Test A has a mean of 60 and a standard deviation of 12 points and test B has a mean of 40 and a
standard deviation of 6 points. If each is to have the same effect in determining final class
standing, the instructor must:
a. add 20 points to scores on test B.
b. multiply scores on test B by 2.
c. divide scores on test B by 2.
d. multiply scores on test B by 2 and subtract 20 points.

c. 24. The No Child Left Behind act (NCLB) of 2001 requires that states to do all of the following
EXCEPT
a. develop standards for academic achievement.
b. develop criterion-referenced assessments to measure student achievement.
c. hold students accountable for their achievement through high-stakes uses of test scores.
d. hold schools and districts accountable for the achievement of all students.
Chapter 8: Assessing Special Populations: Psychometric, Legal, and Ethical Issues

Chapter Outline

A. Introduction
B. Summary of major legislation and litigation
1. Influential legislation
a) Education for all Handicapped Children Act
b) Individuals with Disabilities Education Act and its amendments
c) Americans with Disabilities Act
d) Family Educational Rights and Privacy Act
e) Elementary and Secondary Education Act
f) No Child Left Behind
2. Influential litigation
C. Assessment processes for special education
1. Referral to program implementation and evaluation
a) Identification and referral
b) Determination of eligibility
c) Program planning, implementation, and evaluation
D. Major domains of involvement in special education assessment
1. Intelligence and cognitive functioning
2. Adaptive behavior and self-help skills
a) Theory of general competence
b) Emotional and social competence
3. Behavioral and social-emotional functioning
4. Neuropsychological functioning
E. Assessment of English learners
1. Introduction
2. Assessment of language proficiency
a) Receptive vs. expressive communication skills
b) Factors impacting rate of language acquisition
c) Approaches for assessment
3. Assessment of academic skills for English learners
a) Lack of proficiency as threat to validity
b) Use of translations and interpreters
4. Special education assessment for English learners
a) Limited English proficiency is not a disability
F. Traditional academic functioning
1. Reading, math, and written-language assessment
2. Curriculum-based assessment
3. Ecological assessment
G. Professional standards and ethics
1. Introduction
H. Professional training and competence
1. Professional training
a) Formal training
b) Importance of staying current
2. Professional competence
a) Competencies for the proper use of tests
I. Professional and scientific responsibility
1. Standards for educational and psychological testing
66
Chapter 8: Assessing Special Populations
a) Test construction, evaluation, and documentation
b) Fairness and testing
c) Testing applications
J . Respect for the rights and dignity of others
1. Privacy and confidentiality
a) Who will benefit?
b) How will the information be used?
K. Social responsibility
1. Distributive justice
2. Social benefits of testing
3. Maximizing the positive
a) Examine and be clear about all values involved
b) Recognize that test scores are only indictors
c) Relate test results to other information known about examinees
d) Recognize the possibility for error in all measurement
e) Acknowledge the limits of human wisdom
f) Maintain tentativeness about the basis for decisions
L. Summary


1. What was the major effect of each of the following pieces of legislation?
a. The Americans with Disabilities Act.
The law extended the provisions of the Civil Rights Act to people with disabilities.
b. The Education for All Handicapped Children's Act and IDEA.
These laws assure all children of a free, appropriate public education in the least restricted
environment possible and that assessment processes be nondiscriminatory.
c. The Family Educational Rights and Privacy Act.
This law guarantees parents access to their children's records, the right to challenge information
in these records, and assurance that records will not be released to unauthorized people.
d. Elementary and Secondary Education Act and NCLB
Affect educational programs supported by federal funds such as those serving low- income
students and English learners.

2. What was the effect of each of the following court decisions?
a. Brown v. Board of Education (1954).
All children have an equal right to a public education, regardless of race.
b. Guadalupe Organization v. Tempe Elementary School District (1972).
Special provisions for accurate assessment must be made when testing children whose primary
language is not English.
c. Hobson v. Hansen (1967).
Called into question basing special placement solely on group tests. Also first called into
question disparate impact by racial group.
d. Larry P. v. Riles (1984) and PASE v. Hannon (1980).
In Larry P a judge ruled that standardized intelligence tests were biased against black children.
The judge in PASE came to the opposite conclusion.

67
3. What is the normal referral sequence for special education services?
First, a brief screening assessment is made of all children. Those children, usually about 10%, who
are judged to be at risk for learning difficulties, are given in-class support by a pre-referral team.
Children for whom this service is deemed insufficient are referred, with parental permission, for a
thorough evaluation for eligibility for special education services by an assessment team.

4. What are the major domains of involvement for the provision of special services?
a. Intelligence and cognitive functioning
b. Adaptive behavior and self-help skills
c. Personal adjustment and social-emotional functioning
d. Neuropsychological functioning

5. What are the purposes that minimum competency testing is intended to serve?
a. To identify students in need of remedial instruction.
b. To assure that promoted students exceed a minimum level in necessary skill areas
c. To provide motivation to achieve at least the minimum level of competence.

6. When are "emotionally disturbed" children eligible for special services?
When their behavioral disturbances are symptoms of deeper emotional problems.

7. Why is there increasing concern regarding assessment of English learners?
Changing demographics mean that large proportions of children served by U.S. public schools speak
a language other than English as their preferred language.

8. What is BICS? What is CALP?
BICS stands for Basic Interpersonal Communication Skills and refers to the oral language skills
of speaking and listening acquired without formal instruction.
CALP stands for Cognitive Academic Language Proficiency and refers to reading and writing
skills acquired through formal instruction as well as more formal and precise applications of
language used in academic contexts.

9. What are the challenges involved with assessing the academic skills of English learners?
a. Development of reliable and valid measures for all students with non-English backgrounds is very
difficult.
b. Translations are prohibitively expressive and technically difficult.
c. Direct translations may generate non-equivalent tasks.
d. Interpreters introduce inaccuracies.

10. What are the academic areas in which children most often experience difficulties?
Reading is most common, followed by mathematics and written language.

11. How is curriculum-based assessment different from conventional assessment techniques?
Curriculum-based assessment focuses on the specific skills that are the objectives of instruction. The
frame of reference is the curriculum and the child's level of mastery of it.

12. What is ecological assessment?
Emphasis is shifted away from the learner and his or her difficulties and on to the characteristics of
the learning environment. Environmental factors might include teacher expectations, the physical
environment in the classroom, and the types of tasks students are asked to engage in.

68
13. How can tests help develop talent in all segments of the society?
Tests characterize each person as an individual, not as a member of any group. Tests are supportive
of a meritocracy, a system in which persons are rewarded on the basis of their abilities.

14. List the ways that you can maximize the positive aspects of testing.
a. Examine and be clear about all values involved.
b. Recognize that test scores are only indicators or signs.
c. Recognize test results as only one type of descriptive information.
d. Relate test results to whatever else is known about the person or group.
e. Recognize the possibilities of error in all types of descriptive information.
f. Acknowledge the limits of human wisdom, and maintain tentativeness about the basis for
decisions.

Important Terms

a. PL-94-142
b. Americans with Disabilities Act
c. Brown v. Board of Education
d. Emotionally disturbed
e. Larry P. v. Riles
f. Guadalupe Organization v. Tempe Elementary School District
g. Learning disability
h. Family Educational Rights and Privacy Act
i. Adaptive behavior
j. Test bias
k. Ecological Assessment
l. Distributive justice
m. Basic Interpersonal Communication Skills (BICS)
n. Curriculum-based assessment
o. Confidentiality


____1. Guarantees parents the right to see their children's school records.

____2. Law that guarantees placement of children eligible for special education in the least restrictive
environment.

____3. The Education for All Handicapped Children Act--The law that specifies a student's entitlement
to special education services.

____4. Children with average or above average ability who are not able to perform academically at grade
level.

____5. A child who can't function in a regular academic environment, not because of a lack of ability,
but rather as the result of poor mental health.

____6. The ability to function in non-academic settings.

69
____7. Beliefs about fairness.

____8. Language acquired by members of a group without need for formal instruction.

____9. Extends civil rights legislation to people with disabilities.

____10. Decision relating to assessment of English learners.

____11. Major court case about equal access to educational opportunities.

____12. Evaluation of the entire instructional environment.

____13. Assessment of a childs day-to-day skill development

____14. Degree of access to information about a person.

____15. Invalidity of a test for a particular cultural ethnic group.

____16.Court decision that found intelligence tests to be biased against minorities.


1. h 5. d 9. b 13. j
2. a 6. i 10. f 14. o
3. a 7. l 11. c 15. j
4. g 8. m 12. k 16. e


c. 1. What law guarantees disabled children a right to a free public education in the least restrictive
environment possible?
a. The All Handicapped Children's Act.
b. The Americans with Disabilities Act.
c. FERPA.
d. The Civil Rights for the Handicapped Act.

d. 2. What law guarantees the parents of children the right to inspect their child's educational
records?
c. FERPA.

d. 3. A parent who wished to challenge the accuracy of a test score in his child's school records
would most likely refer to which law?
c. FERPA.
70
e. 4. Which of the following is NOT a law relating to the rights of disabled students?
c. FERPA.
d. IDEA
e. The Civil Rights for the Handicapped Act.

d. 5. What is the main reason why children continue to be labeled with terms such as "mentally
handicapped."
a. This terminology plays a useful educational role.
b. These disabilities can be treated only when the student is willing to admit that a problem
exists.
c. These problems need to be dealt with honestly, it does no good to pretend they don't exist.
d. Only students who have been legally designated as handicapped can receive federally and
state funded services.
e. The terminology is essential to describe the type of service most appropriate for a given child.

a. 6. Which of the following legislative decisions is the basis for the claim that educational services
cannot be denied to children with disabilities?
a. Brown v. Board of Education
b. Hobson v. Hansen
c. Larry P. v. Riles
d. PASE v. Hannon
e. Guadalupe Organization v. Tempe Elementary School District

d. 7. If you believed that cognitive ability tests were not biased against minorities, which court
decision should you cite to support your case?
a. Brown v. Board of Education
b. Hobson v. Hansen
c. Larry P. v. Riles
d. PASE v. Hannon
e. Guadalupe Organization v. Tempe Elementary School District

d 8. Five major court decisions are listed below. In which two cases were essentially opposite
decisions reached?
I. Brown v. Board of Education
II. Hobson v. Hansen
III. Larry P. v. Riles
IV. PASE v. Hannon
V. Guadalupe Organization v. Tempe Elementary School District
a. I and IV
b. I and V
c. II and IV
d. III and IV
e. IV and V

71
b. 9. What characteristic is most typical of students with a learning disability?
a. Overall poor academic performance.
b. Difficulty in learning to read.
c. A dominant right brain.
d. A deprived family background.
e. Being a female.

b. 10. At the present time, what type of problem is most closely identified with learning
disabilities?
a. Social.
b. Learning behavior.
c. Neurological.
d. Intellectual.
e. Sensory deficit.

d. 11. During which of the following stages of the referral sequence would an individualized
educational program ordinarily be formulated?
a. Screening.
b. Prereferral intervention.
c. Formal referral.
d. Eligibility determination.
e. Follow-up evaluation.

b. 12. During which stage of the referral process would a child ordinarily first encounter special
support services?
a. Screening.
b. Prereferral intervention.
c. Formal referral.
d. Eligibility determination.
e. Follow-up evaluation.

d 13. Under the provisions of IDEA, how often must a child in special education receive a complete
psychoeducational reevaluation?
a. Every year.
b. Whenever a new IEP is formulated.
c. Whenever the childs parent or guardian requests it.
d. Every three years.
e. Whenever the child enters a new school.

b. 14. When an assessment is focused on abilities other than those used in school in reaching a
diagnosis of mental retardation, the assessment is most likely addressing
a. cognitive functioning.
b. adaptive behavior.
c. social-emotional functioning.
d. achievement deficit.
e. neuropsychological functioning.

72
e. 15. An assessor who includes information about the home environment as well as school is most
likely adopting which assessment model?
a. The universal model.
b. The curriculum-based model.
c. The learning community model.
d. The environmental model.
e. The ecological model.

c. 16. Who has the ultimate responsibility for ensuring that a test is valid for the purposes for which
it is intended?
a. The test publisher.
b. The test author.
c. The test user.
d. The professional organization to which the user belongs.
e. State boards.

c. 17. Where does the main impetus for outcomes based assessment come from?
a. Teachers.
b. School administrators.
c. Politicians.
d. Students.
e. Parents.

a. 18. Federal guidelines indicate all of the following can be considered a disability except
a. classroom behavior problems.
b. language problems
c. orthopedic impairments.
d. problems with vision.
e. none of these; all can be considered disabilities.

b. 19. Which of the following is true for assessing a child whose primary language is not English?
a. Any test given must not stress spoken language.
b. The test results must be explained to the parents in their primary language.
c. Any testing conducted by an English speaking evaluator will be invalid for this child.
d. Any tests used must provide norms for children whose primary language is not English.
e. A performance test score must be weighted more heavily than a verbal test score.

d. 20. The set of rules that guides the conduct of people who provide psychological services is
known as
a. practice guidelines.
b. licensure rules.
c. the code of fair testing.
d. ethics.
e. professional training standards.

73
b. 21. Four organizations that are involved in tests and testing practices are listed below. Which of
these organizations have published codes of ethics for professional practice?
I. American Counseling Association.
II. American Educational Research Association.
III. American Psychological Association.
IV. National Council on Measurement in Education.
a. I and II only.
b. I and III only.
c. I, II, and III only.
d. III and IV only.
e. all of them.

d. 22. Four organizations that are involved in tests and testing practices are listed below. Which of
these organizations published the Standards for Educational and Psychological Testing?
I. American Counseling Association.
II. American Educational Research Association.
III. American Psychological Association.
IV. National Council on Measurement in Education.
a. I and II only.
b. I and III only.
c. I, II, and III only.
d. II, III, and IV only.
e. all of them.

c. 23. Which of the following is NOT included in the Standards for Educational and Psychological
Testing?
a. Technical standards for test construction.
b. Professional standards for test use.
c. Ethical standards for best practice.
d. Application standards.
e. Administrative standards.

c. 24. What does federal law require regarding bias?
a. Tests should contain no items that have even the appearance of being biased.
b. The test taker must always bear the burden of proof regarding bias.
c. Tests must be valid predictors for everyone with whom they are used?
d. Items that are more difficult for minorities must be eliminated.
e. Test takers must be warned about the possibility of bias when taking tests for employee
selection.

b. 25. What has been the response of test publishers to accusations that their tests are biased?
a. They have ignored them.
b. They have taken every possible step to avoid even the appearance of bias.
c. They have agreed to a moratorium on testing.
d. They recommend that their tests not be used to assess minorities.
e. They have agreed to make copies of the tests available to examinees after testing.

74
75
d. 26. Which type of tests cause the most concern about the invasion of privacy?
a. Intelligence tests.
b. Multiple aptitude tests.
c. Interest inventories.
d. Personality tests.
e. Attitude questionnaires.

c. 27. Under what circumstance would restrictions on testing be most indicated?
a. For a placement test.
b. When some general social good is anticipated.
c. When the information is to be used solely for the benefit of a group other than the one of
which the examinee is a member.
d. For a test used to place students in different sections of a class.
e. For a test to be used in counseling.

d. 28. A test score is best considered
a. an effective means of labeling.
b. an indication of an individual's level of functioning.
c. an assessment that is independent of other data.
d. one descriptive fact about an individual.
e. an invasion of privacy if based on highly personal questions.

b. 29. Which of the following principles should be followed to maximize the positive results of
testing.
I. The values that individuals hold should be acknowledged.
II. The exact and quantitative nature of test scores should be stressed.
III. Test results should be integrated into a matrix of information.
IV. Test results define psychological traits.
a. I and II only.
b. I and III only.
c. I, II and IV only.
d. II, III, and IV only.
e. I, II, III, and IV.

d. 30. Which of the following can most appropriately be taken as evidence that a test is biased?
a. All members of a certain group receive low scores.
b. Norms for ethnic groups are not provided.
c. Average scores for one group differ greatly from the mean.
d. Decision accuracy differs from one group to another.
e. All of these are indicators of test bias.

Chapter 9: Principles of Test Development

Chapter Outline

A. Introduction
B. Suggestions for writing objective items.
1. General principles for objective items
a) Keep the reading level simple
b) Be sure there is one correct answer
c) Be sure the content is important
d) Keep the items independent
e) Avoid trick questions
f) Be sure the item poses a clear problem
2. Writing true-false items.
a) Suggestions for writing better true-false items.
b) Variations in the writing of true-false items.
3. Writing multiple-choice items.
a) Characteristics of the multiple-choice item.
b) Suggesting for writing better multiple-choice items.
c) Variations in the format of multiple-choice items.
d) Complex multiple-choice items.
4. Writing matching items.
a) Characteristics of matching items
b) Suggestions fir writing better matching items.
c) Variation in the matching-item format.
C. Preparing the objective test for use.
1. Preparing the items
2. Test layout
3. Writing test directions
D. Scoring the objective test.
1. Correction for guessing.
a) A formula to correct for guessing.
b) Criticism of the correction-for-guessing formula.
1) Undercorrection.
2) Technical factors.
c) Why the correction-for-guessing formula is used.
E. Using item analysis to improve objective tests.
1. Simplified procedures for conducting item analyses.
a) Item difficulty
b) Discrimination
c) Functioning distractors
2. Formal item-analyses procedures.
a) Discrimination index.
b) Item-total score correlations.
c. Using discrimination indexes.
F. Writing essay items
1. Advantages and difficulties of essay items.
2. Variations on the essay format.
3. Writing essay questions
4. Preparing the essay test.
5. Scoring essay tests.
76
G. Summary.


1. An examination of the test blue print on pages 150 and 151 of the textbook suggests four reasons why
it is best to consider test items to be samples of behavior. What are they?
a. Only those objectives suitable for appraisal with a paper-and-pencil test are included in the
blueprint.
b. The entries in the cells under each area of content are examples that illustrate, but do not
exhaust the total content.
c. There is an unlimited number of items that could be written for the material that is included in
the blueprint.
d. The time available for testing is limited, and you can include only a small sample from the
domain of all possible items.

2. What four principles should guide the construction of a test?
a. The amount of emphasis that should be placed on each of the content areas and objectives. In
other words, the proportion of all the items on the test that should be written for each content
area and for each objective within each content area.
b. The types of items that should be included on the test?
c. Test length. How many questions or items should the total test contain? How many items
should be written for each cell of the blueprint?
d. How difficult should the items be.

3. What factors influence the number of items to be included on a test?
a. The type of item used on the test.
b. The age and educational level of the student.
c. The ability level of students.
d. The length and complexity of the items.
e. The type of process objective being tested.
f. The amount of computation or quantitative thinking required by the item.

4. What is the appropriate level of difficulty for a test?
In general the difficulty of a test should be set at the point halfway between the number of items a
student could be expected to get correct by guessing and the total number of items on the test.

5. List six suggestions for writing good objective items.
a. Keep the reading difficulty and vocabulary level as simple as possible.
b. Be sure each item has a correct or best answer.
c. Be sure each item deals with an important aspect of the content are.
d. Be sure each item is independent.
e. Avoid the use of trick items.
f. Be sure the problem posed is clear and unambiguous.

6. Why are true-false items not considered to be a good method of evaluating student knowledge?
a. It is difficult to control the difficulty of true-false items because they tend to be either too
easy or to assess trivial content.
b. Guessing plays too large a role in determining a student's score.

77
7. List the suggestions for writing better true-false items.
a. Ensure that the item is unequivocally true or false.
b. Avoid the use of specific determiners or qualifiers.
c. Avoid ambiguous and indefinite terms of degree or amount.
d. Avoid the use of negative statements and, particularly, double negatives.
e. Limit true-false statements to a single idea.
f. Make true and false statements approximately equal in length.
g. Include the same number of true as false statements.
h. Write statements in your own words.
i. If the statement is based on opinion, the source for the opinion should be included.

8. What characteristic of multiple-choice items makes them most useful?
The difficulty of these items can be adjusted by manipulating the plausibility of the distractors.

9. List suggestions for writing better multiple-choice items.
a. Be sure the stem of the item clearly formulates a problem.
b. Included as much of the item as possible in the stem and keep options as short as possible.
c. Include in the stem only the material needed to make the problem clear and specific.
d. Use the negative only sparingly in the stem of an item.
e. Use novel material in formulating problems to measure understanding or ability to apply
principles.
f. Be sure that there is one and only one correct or clearly best answer.
g. Be sure wrong answer choices are plausible.
h. Be sure no unintentional clues to the correct answer are given.
i. Use the option "none of these" or "none of the above" only when the keyed answer can be
classified unequivocally as right or wrong.
j. Avoid the use of "all of these" or "all of the above" in the multiple-choice item.

10. What other item type does the matching item most closely resemble?
Matching items resemble a series of multiple-choice items.

11. For what type of material are matching items most effectively used?
When the instructor wishes to assess students' knowledge of a series of associations.

12. List the suggestions for writing better matching items.
a. Keep the statements homogeneous.
b. Limit the number of items in each exercise.
c. Have the students choose answers from the column with the shortest statements.
d. Use a heading for the columns.
e. Have more answer choices than entries to be matched.
f. Arrange the answers in a logical order.
g. Specify whether answers can be used more than once.

13. List suggestions for preparing objective items for use.
a. Arrange items on the test so that they are easy to read.
b. Plan the layout of the test so that a separate answer sheet can be used to record answers.
c. Group items of the same format (true-false, multiple- choice, or matching) together.
d. Within item type, group items dealing with the same content together.
e. Arrange items so that difficulty progresses from easy to difficult.
f. Write a set of specific directions for each item type.
g. Be sure that one item does not provide clues to the answer of another item or items.
78
14. What major assumptions underlie the correction-for-guessing formula?
It is based on the assumption that all incorrect answers are the result of guessing and that all guesses
are completely "blind."

15. Under what conditions does the correction-for-guessing formula undercorrect and under what
conditions does it overcorrect?
It undercorrects when guesses are not truly blind, that is, when the student can eliminate some
distractors as being implausible and the guesses among fewer options. It overcorrects on items that
require a higher level of thought processing and wrong answers are the result of faulty cognitive
strategies rather than guessing.

16. Given its limitations, why is this formula used?
To motivate students to refrain from guessing.

17. What major assumption underlies item analysis procedures?
The assumption that the overall test is measuring what it is supposed to measure.

18. How is the discrimination index computed?
The discrimination index D is computed by subtracting the number of students who got an item
correct in the lower group (CL), from the number who got the item correct in the upper group (CU)
and dividing it by the number in each group (N).

19. When is a negative D most likely to occur?
With complex material for which the correct response can be selected without any real understanding
of the underlying concept being measured. This makes it possible for the poorer student to get the
item correct by guessing, while the better student may struggle through the convoluted logic required
to solve the problem and ultimately end up with the wrong answer.

20. How is an item analysis conducted using item-total score correlations?
It is generally performed using a computer. Each item is correlated with the total score and the
resulting correlation can be interpreted in the same way as the D discrimination.

21. What factor is most likely to cause the discrimination index or item-total correlation to be
suppressed?
Items that are either too easy or too difficult. The discrimination index is more affected by this item
characteristic than are item-total correlations.

22. When would an essay test be preferred over an objective test?
Essay tests must be used to appraise ability to organize a response, originality, and argumentation.

23. What are four variations on the essay test?
a. The open-book exam
b. The take-home exam
c. The use of study questions from which a small number will be selected
d. "Cheat sheets"

79
24. What are the principles for writing good essay test items?
a. Know what mental processes you want to assess
b. Use novel material
c. Use words such as "compare" or "contrast"
d. Make the task clear
e. If you use controversial material, the task should be to take a position and defend it
f. Adapt length and complexity to the ability level of the students

25. List six principles for scoring responses to essay questions.
a. Decide in advance what qualities will be judged
b. Prepare model answers
c. Read all the answers to one question before going on to the next one
d. Shuffle papers between scoring questions
e. Grade papers anonymously
f. Write comments and correct errors

Important Terms

a. Clang association
b. Correction for guessing
c. Difficulty index
d. Item discrimination method
e. Item-total correlation
f. Matching items
g. Objectives-referenced tests
h. Overmutilated item
i. Selection response items
j. Supply response items
k. Test blueprint
l. Unidimensional test


____1. Method of conducting an item analysis that is based on determining the relationship between
individual items and the total score.

____2. A test made up of items that are homogeneous and measure a single trait.

____3. An item analysis technique based on an examination of each item in terms of the difference in
performance on each item between the best and the worst students.

____4. A detailed outline of content used in the preparation of tests.

____5. An item type for which the student has to provide a response in his or her own words.

____6. An item type for which the student chooses the correct answer from among choices that are
provided.

____7. The percentage of students that get an item correct.
80
____8. The process of dividing the number of incorrect items on a test by one less than the number of
options and subtracting this amount from the number of correct items on a test.

____9. A completion item with too many blanks.

____10. The practice of constructing tests by associating a set of items with an objective and judging
whether it has been mastered by determining whether a preset number of items has been correctly
answered.

____11. On a multiple-choice test, the repetition of a word, phrase, or sound in the question and the keyed
response.

____12. A type of test that includes a set of facts or terms and set of definitions. The student is required to
associate the terms and facts with the definitions or descriptions.


1. e 5. j 9. h
2. l 6. i 10. g
3. d 7. c 11. a
4. k 8. b 12. f

Multiple choice items

c. 1. In the construction of objective test items, the use of complex sentence structures and
sophisticated vocabulary is generally considered
a. appropriate, because this tests higher level cognitive skills.
b. appropriate if the examinees are of above average intelligence.
c. inappropriate, because this makes the test a measure of reading ability.
d. inappropriate, because the items are harder to write well.
e. appropriate if the conditions in both alternatives a and b are met.

e. 2. The following true-false item was written for an examination in measurement:
The reliability of the block design subtest of the WISC-III in the standardization sample is
.66."
This true-false item would be considered poor because the statement:
a. cannot be classified as absolutely true or absolutely false.
b. does not require application of a student's knowledge.
c. is double-barreled.
d. contains irrelevant cues to the desired answer.
e. measures trivial material.

e. 3. The use of trick questions is considered appropriate under which conditions?
a. The examinees are of above average ability.
b. The material being tested is obvious and well known by the examinees.
c. The correct answer to the question is controversial.
d. It is desirable to get extra spread in the test scores.
e. None of these conditions; trick questions are never appropriate.

81
a. 4. The following true-false item was written for an examination in measurement:
"Multiple-choice questions are preferred over essay-type questions."
Which of the following is the most important fault with this item?
a. It cannot be classified as absolutely true or absolutely false.
b. It does not require application of a student's knowledge.
c. It is double-barreled.
d. It contains irrelevant cues to the desired answer.
e. None of these; the item is acceptable as written.

d. 5. The following true-false item was written for an examination in measurement:
"Multiple choice items are always preferred to true-false items in standardized tests."
d. It contains a specific determiner.

c. 6. The following true-false item was written for an examination in measurement:
"It is never appropriate not to use ambiguous test items."
b. It is double-barreled.
c. It contains a double negative.

c. 7. The following true-false item was written for an examination in measurement:
"Multiple choice test items are generally preferred to true-false test items because
they are easier to write and cannot measure knowledge of facts."

c. 8. Which type of item generally is LEAST desirable for an objective test?
a. Application level multiple-choice items.
b. Recall of facts level multiple-choice items.
c. True-false items.
d. Matching items.
e. Multiple-multiple choice items

a. 9. The practice of underlining the key word in true-false items is considered
a. good practice because it reduces item ambiguity.
b. good practice because it reduces the effects of guessing.
c. poor practice because it gives examinees clues to the right answer.
d. poor practice because it limits the complexity of the statements that can be used.
e. poor practice because it encourages the use of specific determiners.

82
c. 10. Which of the following is NOT an accepted variation on the true-false item format.
a. Underlining the important term.
b. Requiring examinees to correct false statements.
c. Linking two statements together to present a complex idea.
d. Basing several items on common stimulus material.
e. None of the above; these are all accepted variations on the true-false format.

b. 11. For a five alternative multiple-choice test, about what should the average item difficulty be?
a. .50
b. .60
c. .75
d. .83
e. .90

c. 12. On a four response multiple-choice test with 80 items, on the average, about how many items
would you want students to get correct?
a. 40
b. 45
c. 50
d. 60
e. 65

c. 13. Which of the following would be considered to be a produced-response type item?
a. True-False
b. Matching.
c. Short answer.
d. Multiple choice.
e. None of these; they are all selected-response.

e. 14. Most measurement experts consider the most desirable type of question for an objective test to
be:
a. short answer.
b. matching.
c. true-false.
d. modified true-false where the examinee must correct false statements.
e. multiple choice.

a. 15. The most serious limitation of the multiple-choice type of item is that it:
a. cannot appraise originality.
b. requires a high level of reading skill.
c. is limited to the appraisal of recall of knowledge.
d. encourages guessing.
e. is difficult to write.

b. 16. The use of "none of these" is an option in a multiple-choice item is only appropriate when:
a. the number of possible answer choices is limited to two or three.
b. the options provide absolutely correct or incorrect answers.
c. guessing is apt to be a serious problem.
d. more than five correct answers can be provided.
e. the item stem presents an ambiguous problem to the examinee.

83
d. 17. A carefully constructed objective test will:
a. be structured in such a way that the questions can be answered on the same page as the
question.
b. have different types of items dispersed throughout the test.
c. have the items arranged with the most difficult items first.
d. have items with the same content grouped together.
e. have items measuring each content area dispersed throughout the test.

b. 18. In a multiple-choice examination made up of five-choice items, if one wanted to correct the
results for guessing one would be most likely to score the examination:
a. rights minus 1/5 wrongs
b. rights minus 1/4 wrongs.
c. rights minus 1/3 wrongs
d. rights minus 1/2 wrongs.
e. rights minus wrongs.

b. 19. A biology teacher plans to give a final examination consisting of 120 five-option multiple-
choice questions to a group of intellectually gifted tenth graders. He wants to use the test to
obtain a rank order of students according to their level of achievement in biology. What would
be the ideal mean score for the class under these conditions?
a. about 60
b. about 70
c. about 80
d. about 90
d. about 100

d. 20. On a 100-item test, Louis marked 76 answers correctly, 12 answers incorrectly, and omitted 12
items. All questions were four-alternative multiple-choice items. If the scores were corrected
for guessing by the usual formula, Louis' corrected score would be:
a. 52
b. 64
c. 70
d. 72
e. 74

Special Directions: The next 10 questions are based on the following item analysis data of a test in
introductory psychology. There were 150 students in the upper group and 150 students in the
lower group. The data are presented in terms of the percentage of each group choosing each
option. Correct answers are marked with an asterisk (*).

Item No. Percentage Choosing r with total

A B C D

1. upper 07 75* 07 11 .44
lower 19 31* 11 39

2. upper 60* 11 17 12 .77
lower 04* 49 35 12

84
3. upper 88* 05 02 05 .60
lower 30* 20 30 20

4. upper 17 09 61* 11 .24
lower 07 07 37 45

5. upper 44 17 07 31* .11
lower 44 17 13 22*

6. upper 11 00 43* 46 .00
lower 11 19 43* 26

7. upper 27* 73 00 00 -.09
lower 35* 55 00 10

8. upper 25 25 25 25* .00
lower 25 25 25 25*

9. upper 10 90* 00 00 .66
lower 00 25* 50 25

b. 21. What is the difficulty of item 2?
a. 30%
b. 32%
c. 60%
d. 64%
e. 77%

c. 22. In terms of difficulty, in which of the following categories would item 9 be placed?
a. Very easy.
b. Easy.
c. Moderate.
d. Hard.
e. Very hard.

b. 23. Which of the following items is most discriminating?
a. 1.
b. 2.
c. 3.
d. 4.
e. 5.

b. 24. Which of the following items most nearly meets all of the characteristics of the ideal item?
a. 2.
b. 3.
c. 4.
d. 8.
e. 9.

85
c. 25. Which of the following is a positively discriminating item with a negatively discriminating
mislead?
a. 2.
b. 3.
c. 4.
d. 5.
e. 6.

d. 26. Which of the following items make no desirable contribution to differentiating students in
terms of total score?
a. 5 only.
b. 6 and 8 only.
c. 7 only.
d. 6, 7, and 8 only.
e. 5, 6, 7, and 8 only.

d. 27. On which item was there only chance success?
a. 2.
b. 6.
c. 7.
d. 8.
e. 9.

c. 28. Which of the following items has a nonfunctioning mislead?
a. 4.
b. 5.
c. 7.
d. 8.
e. 9.

a. 29. Consider each of the following suggested changes for item 9.
I. Replace option A.
II. Make the item easier.
III. Make options C and D more attractive to the upper group.
Which of these changes should be made?
a. I only.
b. II only.
c. III only.
d. I and II only.
e. None of them; leave the item as it is.

e. 30. On the basis of the item analysis data, which items can be used again without revision?
a. 3 only.
b. 1 and 2 only.
c. 2, 3, and 9 only.
d. 1, 2, 3, and 9 only.
e. 1, 2, and 3 only.

86
87
e. 31. From the perspective of item analysis, an item that everyone got correct would be considered:
a. good because it indicates that students have learned the material.
b. good because it increases student morale.
c. good if it is put at the beginning of the test as an ice breaker.
d. bad because it contributes to grade inflation.
e. bad because it contributes nothing to variability.

c. 32. What would be the D for an item that every student got correct?
a. 1.00
b. .50
b. .00
c. -.50
d. -1.0

b. 33. In general, which question is likely to have a negative discrimination index? The question that:
a. the best students get right.
b. the worst students get right.
c. about half of the students get right.
d. nearly everyone gets right.
e. nearly everyone gets wrong.

b. 34. When using the correlation item analysis approach, each item is correlated with:
a. the score on a criterion measure.
b. the total score on the test.
c. the average score on test.
d. every other item on the test.
e. the item trace line.

d. 35. In scoring an essay test with more than one question, it is considered good practice to
a. read each person's paper entirely before going on to the next one
b. maintain the papers in the same order for reading all answers
c. know whose paper you are reading so you can adjust expectations accordingly
d. read all answers to question one before reading any answers to question two
e. all of the above are recommended practices

b. 36. Most measurement experts would say that the practice of allowing students to select which
essay questions they will answer is:
a. good practice because it improves motivation
b. bad practice because it makes it harder to accurately compare the quality of students'
answers
c. good practice because it encourages students to learn one area more deeply
d. bad practice because it allows students to ignore some of the subject matter the test
covers
e. good practice because it allows students to answer the questions where they have the most
knowledge.
Chapter 10: Performance and Product Evaluation

Chapter Outline

A. Introduction.
B. The artificiality of conventional cognitive tests.
1. Inappropriate use of conventional cognitive tests.
C. Assessing products.
1. When the assessment of the process is necessary.
a. Safety.
b. Transitory performance.
c. Difficult to evaluation products.
D. Applying performance and product evaluation to cognitive tasks.
E. Assessing processes.
1. Using checklists.
2. Constructing rating scales
F. Evaluating products and processes.
1. Using observers.
2. Advantages of multiple observers.
3. Reliability for multiple observers.
G. Systematic observations.
1. Conducting the systematic observation.
a. What should be observed?
b. What behaviors represent a personality attribute?
c. When and for how long should observations be made?
d. How observers should be trained?
e. How observations should be organized?
2. Advantages and disadvantages of systematic observations.
a. Advantages of systematic observations.
1) A record of actual behavior.
2) Use in real life situations
3) Use with young children.
b. Disadvantages of systematic observations.
1) Cost of making the observations.
2) Problems in fitting the observer into the setting.
3) Difficulties in eliminating subjectivity and bias.
4) Difficulties in determining a meaningful and productive set of
behavior categories.
H. Summary.

88

1. Why might conventional methods of assessing reading comprehension on standardized achievement tests
be considered artificial?
Standardized achievement tests require the use of selection type items. This is usually implemented by
having the student answer multiple-choice items based on a short paragraph. This is in contrast to
conventional definitions of reading comprehension, which emphasize the ability to read entire chapters in
books and be able to critically analyze and synthesize what has been read in the previously learned
material.

2. Under what circumstances would the assessment of processes by more important than the assessment of
products?
a. When the process involves important safety considerations such as in a chemistry experiment.
b. When the product is transitory and is therefore of less importance than the process with which it is
achieved.
c. When the product is particularly difficult to assess.

3. What factor has caused increased interest in the use of the study of processes and products to assess
cognitive processes?
The emergence of cognitive psychology and the concomitant increased interest in the assessment of
higher level thought processing. This is usually accompanied by the belief that this function is best
measured by assessing the process and products of learning.

4. List the steps that should be included when you are constructing a checklist.
a. Designate the appropriate performance or product.
b. List the important behaviors or characteristics.
c. Include common errors.
d. Put the list into an appropriate format.

5. List three ways of constructing a rating scale to assess a process.
a. Have the observer respond to each behavior or characteristic with a number.
b. Have the observer make a check mark on a graphically represented scale.
c. Have the rater think of individuals that represent the extremes of the scale.

6. What are the advantages to using multiple observers to evaluate checklists?
a. It makes it possible to determine whether observers are in agreement and thus to assess the
b. reliability of the ratings.
c. A set of combined scores will usually be more reliable than individual scores.

7. In what settings have systematic observations been most fully developed?
For use with young children.

8. List the questions that should be asked when you conduct a systematic observation.
a. What should be observed?
b. What behaviors represent a personality attribute?
c. When and for how long should observations be made?
d. How should observers be trained?
e. How should the observations be organized?

89
9. Why is the use of systematic observations so effective with young children?
We cannot ask young children to provide us with self-reports and they are not self conscious about
revealing themselves in their behaviors.

Important Terms

a. Checklist
b. J udge
c. Observer
d. Performance
e. Product
f. Systematic observations

Match the following descriptions with the above terms.

____1. Structured observations in a natural setting.

____2. A sculpture or short story.

____3. A person used to evaluate a behavior or characteristic.

____4. A speech or interpretive dance.

____5. An individual who records behaviors.

____6. A list of characteristics which the student is required to respond with an indication of their
presence or absence.


1. f 3. b 5. d
2. e 4. e 6. a


c. 1. Most reading comprehension tests measure the ability to:
a. think critically.
b. evaluate long prose passages.
c. select the main idea from a passage.
d. relate new to previously learned material.
e. integrate the ideas from a passage of text.

90
d. 2. For which of the following classes would a pencil and paper test be most appropriate?
a. music.
b. art.
c. drama.
d. science.
e. industrial arts.

d. 3. In evaluating products such as a sculpture created in an art class or performance in a speech class,
what should be the focus of the evaluation?
a. Whether the student understands the principles of aesthetic forms.
b. The student's attitude towards the project.
c. The process leading up to the product.
d. The product itself.
e. Whether proper technique was used in producing the object.

a. 4. A good reason to have an evaluation focus on the process rather than the product would occur
when:
a. the product can be considered a transition to other goals.
b. time is of primary importance.
c. it is easier to evaluate the process.
d. safety is not a concern.
e. the cost of materials makes the product too expensive.

c. 5. The primary reason there has there been an increased interest in the use of performance tests for
assessing cognitive functioning is because
a. of widespread adoption of direct instruction techniques.
b. of the influence of behavioral psychology.
c. there is interest in assessing higher level processing.
d. of the unreliability of conventional assessment techniques.
e. they are easier to evaluate.

a. 6. The technique of organizing observations by listing relevant items and having the observer indicate
their presence or absence is called:
a. a checklist.
b. an unobtrusive scale.
c. a rating scale.
d. an anecdotal record.
e. behavioral analysis.

c. 7. A device that permits an observer the opportunity to record the intensity or degree of impression
made while observing a subject or setting is called:
a. a scorecard.
b. a checklist.
c. a rating scale.
d. an anecdotal record.
e. a behavioral observation form.

91
d. 8. In the evaluation of a process, why are checklists generally preferred over rating scales?
a. Checklists are easier to construct.
b. Checklists require less expertise.
c. Rating scales cannot be used to assess processes.
d. Rating scales require too many complex decisions.
e. None of the above; rating scales are preferred.

e. 9. Why is it important to record data about a procedure while the examinee is performing?
a. It saves time in recording.
b. It makes the observer's task more concrete.
c. The product may be intangible.
d. Validity is the most important characteristic of data collection.
e. A delay may result in inaccuracies in data collection.

c. 10. When would the ranking of products be useful?
a. For group projects.
b. For individual assignments.
c. When each student creates a similar product.
d. When the product is not concrete.
e. When there is a large number of products to be rated.

e. 11. In addition to the designation of appropriate performances or products and the listing of the
important behaviors and characteristics, checklists also should include:
a. a rating scale.
b. well defined anchors.
c. a random order.
d. an odd number of response options.
e. commonly made errors.

d. 12. The main advantage claimed for the global approach to evaluating products is that it
a. permits the analysis of components of the product.
b. permits more precise feedback to students.
c. is most compatible with absolute standards.
d. facilitates normative comparisons.
e. makes the rater's task more objective.

a. 13. Behavioral observation procedures differ from ratings in that:
a. observational procedures avoid interpretations.
b. ratings require trained observers.
c. observations require synthesis and evaluation.
d. ratings emphasize the provision of an accurate record.
e. observations produce information in quantitative form.

c. 14. With what group are behavioral observations most often used?
a. the mentally ill.
b. students.
c. young children.
d. the elderly.
e. people who are uncomfortable making ratings.

92
93
a. 15. Which schedule of observations would be most desirable?
a. A number of relatively brief periods on different days.
b. One or two long periods of observations.
c. One extended period of observation.
d. One brief period of observation.
e. All four methods are considered equal so long as the same amount of time is spent
observing.

a. 16. When implementing a program of systematic observation of an individual child or of a classroom
group, one usually wants the observer to function:
a. only as a recorder of what is seen.
b. as an active participant in the group.
c. as a recorder and synthesizer of what is seen.
d. as a recorder and interpreter of what is seen.
e. as a recorder of and participant in what happens.

a. 17. The most difficult aspect of systematic observations is:
a. the extraction of meaning from the behaviors.
b. their inappropriateness for use with children.
c. the tendency of behaviors to occur independently.
d. the time needed for interpretation when behaviors are recorded.
e. appropriate analysis of the large mass of information that observations produce.

c. 18. One important virtue of direct observation is that it:
a. is economical and efficient.
b. digs into the inner motives of the individual.
c. can be applied in natural real-life situations.
d. yields a record of behavior that is directly meaningful.
e. assures that the interpretations of events are recorded.

c. 19. Which of the following would NOT increase the reliability of observations?
a. Specifying the behavior precisely.
b. Utilizing practice sessions for the observers.
c. Permitting time between the observation and its recording.
d. Training of observers.
e. Requiring all observers to use a common recording form.

b. 20. Which of the following is a disadvantage of systematic observation?
a. Their use is restricted to the laboratory.
b. It may be difficult to fit the observer into the setting.
c. They can't be easily used with children.
d. The observed behaviors tend to be artificial.
e. The observer remains unobtrusive.
Chapter 11: Attitude and Rating Scales

Chapter Outline

A. Introduction.
B. Learning about personality from others.
1. Letters of recommendation
a) Widely used
b) Problems with letters
C. Rating scales.
1. Problems in obtaining sound ratings.
a) Factors affecting a rater's willingness to rate conscientiously.
1) Unwillingness to put forth sufficient effort.
2) Emotional reaction to person being rated.
b) Factors affecting a rater's ability to rate accurately.
1) Lack of opportunity to observe.
2) Covertness of the trait.
3) Ambiguity in the meaning of the dimension.
4) The lack of a uniform standard of reference.
5) Specific rater biases and idiosyncrasies.
c) The outcome of factors limiting rating effectiveness.
1) The generosity error.
2) The halo effect.
3) Low reliability.
4) Questionable validity.
2. Improving the effectiveness or ratings.
a) Refinements in the rating instrument.
b) Refinements in presenting the stimulus variables.
1) Providing better definitions of traits.
2) Replacing trait names with more limited and concrete descriptive
phrases.
3) Replacing trait names with a substantial number of descriptions of
specific behaviors.
c) Refinements in response categories.
1) Percentage.
2) Graphic scales.
3) Behavioral statements.
4) Person-to-person scales.
5) Present or absent scales.
6) Frequency of occurrence or degree of resemblance.
3. Improving the accuracy of ratings.
a) Selecting raters.
b) Deciding who should choose the raters.
c) Providing education for raters.
d) Selecting qualities to be rated.
e) Pooling independent ratings from several raters.
f) Constructing scales based on rater input.
g) Focusing raters on behaviors prior to rating.
4. Rating procedures for special situations.
a) Adaptive behavior scales.
b) Ranking procedures.
94
c) The forced-choice format.
D. Measuring attitudes.
1. General guidelines
2. Summative ratings.
a) Number of steps.
b) Types of anchors.
c) An odd or even number of steps.
3. Single-item scales.
4. Example of an attitude rating scale.
a) Item selection
b) Item analysis
c) Reliability
7. Alternative formats
a) Thurstone and Guttman scales
b) Semantic differential
c) Implicit attitudes
E Summary.


1. What does the research literature tell us about the effectiveness of letters of recommendation?
Very little, there are few studies about their adequacy or effectiveness.

2. In general, what kind of comments would you expect in a "negative" letter of recommendation?
Some positive comments, usually about characteristics that are not relevant.

3. Under what conditions is a rater likely to be unwilling to rate conscientiously?
a. Situations where the rater is unwilling to put forth sufficient effort.
b. Situations where the rater identifies with the persons being rated.

4. To whom does a rater tend to owe the greatest allegiance?
The worker or student with whom he or she has worked most closely rather than the impersonal
agency requesting the evaluation.

5. List factors affecting a rater's ability to rate accurately.
a. There may be no opportunity for the rater to be rated.
b. The trait to be rated may be covert.
c. The trait being rated may be ambiguous.
d. There may be no uniform standard of reference.
e. Each rater may have his or her own rater idiosyncrasies.

6. What are the outcomes of factors limiting rater effectiveness?
a. The generosity error.
b. The halo effect.
c. Low reliability.
d. Questionable validity.

95
7. How can the stimulus variables be refined to improve ratings?
a. The trait names can be better defined.
b. Trait names can be replaced by several more limited and concrete descriptive phrases.
c. Each trait name can be replaced with a substantial number of descriptions of specific behaviors.

8. How can the response categories be improved?
a. Have the rater think of the person rated in terms of percentages.
b. Use graphic scales.
c. Use behavioral statements.
d. Use person-to-person scales.
e. Use present-absent forms of response.

9. Who is the ideal person to do the rating?
The person who has most contact with the person being rated.

10. How should the decision about who should do the ratings be made?
It is not a good idea to leave these choices to the person being rated. One option is to allow the
person rated to provide a list of possible raters. They might specifically be asked to suggest
individuals that could provide ratings in specific areas.

11. What qualities should be rated?
Raters should only be used to measure qualities that cannot be assessed in any better way.

12. What two things can be done to further enhance ratings?
a. Alert the raters ahead of time about the characterizations on which they should focus.
b. Pool ratings form several raters.

13. Under what circumstances are adaptive rating scales most often used?
As a supplement to individual intelligence tests in the making of special education decisions.

14. What are the disadvantages to adaptive rating scales?
a. The norms for existing scales are inadequate.
b. The informant may not have adequate contact with the child being evaluated.
c. Some of the questions on these scales are intrusive and as a result there may be an unwillingness
to give honest answers. This is particularly likely to happen when a parent is an informant.
d. These scales cannot be used as a substitute for standardized measures of intelligence.

15. What is the biggest advantage of ranking procedures?
It eliminates the tendency of raters to be overly generous.

16. What is the biggest disadvantage of the forced-choice approach?
Raters strongly resist this approach because it takes control away from them.

17. Why are people more willing to give honest answers about attitudes than personality traits?
Because personality traits tend to have obvious and clear-cut levels of social desirability while there is
no such consensus about the social desirability of attitudes.

18. Why are some items stated positively and some negatively on attitude ratings scales?
To control for acquiescence.

96
19. List the three decisions that need to be made when constructing an attitude rating scale.
a. The determination of the number of steps to be included.
b. The selection of anchors.
c. The decision about whether to use an odd or an even number of steps.

20. Should an odd or an even number of steps be used?
There are two points of view on this topic. The advantage of an odd number of steps is that it forces
the individual to make a decision and not just select the middle point. On the other hand there will be
individuals whose frustration with not being able to choose a middle point may be manifested by the
omission of the item.

21. What is the problem with single item scales?
There is no way to determine the validity of such a scale and there is the possibility of very low
variability with the single item if everyone responds the same way.

22. How do you conduct an item analysis on an attitude rating scale?
You would want to examine frequency distributions of the different responses and compute
correlations between items and the overall score on the scale.

Important Terms:

a. Acquiescence j. Letters of recommendation
b. Adaptive behavior scale k. Likert scale
c. Alternation ranking l. Implicit attitude
d. Behaviorally anchored scale m. Present-absent scale
e. Covertness of trait n. Rating scale
f. Forced choice o. Response options
g. Generosity error p. Stimulus variable
h. Graphic scale q. Summative rating scale
i. Halo effect


____1. Another name for a summative rating scale.

____2. A method of measuring attitudes using reaction time.

____3. Using a third person (usually a parent or teacher) to provide information about whether a child
has reached specific developmental stages.

____4. The quality to be rated on a rating scale.

____6. The type of ratings that can be given on a rating scale.

____7. The construction of a rating scale using specific behaviors.

____8. The tendency for respondents to agree with statements on an attitude rating scale.

____9. A structured method of learning about person A by asking person B.
97
____10. A method of assessing attitudes in which the response to items is quantified and the
performance across items is totaled to obtain a numerical indication of the strength of an
attitude.

____11. A method of obtaining unstructured information about a person who is a candidate for
something such as admission to a school, a scholarship, a job, membership in a club. Etc.

____12. A rating scale that requires judges to indicate for each of a series of statements whether they
apply or do not apply to an individual.

____13. The tendency to consider average a slur and to give everyone a high rating.

____14. A method of rating in which the rater must choose from among a set of descriptors, each of
which is equally socially desirable.

____15. A type of rating scale prepared in such a way that the rater needs only make a mark at an
appropriate point on a line representing the degree to which the attribute is possessed.

_____16.A method of obtaining a rating of an individual by having judges first of all identify the
individual who belongs at the tope and the one that belongs at the bottom. J udges then select the
next highest and lowest until everyone has been rated.

____17. Tendency for judges to allow the ratings of one trait to be influenced by other traits.


1. k 6. o 11. j 16. c
2. l 7. d 12. m 17. i
3. b 8. a 13. g
4. e 9. n 14. f
5. p 10. q 15. h


d. 1. Which of the following is true of research findings about letters of recommendation?
a. They have been shown to have high reliability.
b. It has been demonstrated that strength of endorsement bears no relationship to job success.
c. It has been shown that they are not really used by employers.
d. Very little research has been done with them.
e. Efforts to manipulate letter content by selecting referees have not been successful.

c. 2. What recent development has made letters of recommendation less useful?
a. The availability of computers.
b. Decline in the average level of writing ability.
c. Laws requiring that the content of personnel files must be open to the subject of the files.
d. The general tendency in society to place more confidence in objective measures.
e. Laws regulating who may be selected to provide information.

98
c. 3. Approximately what is the consistency among writers of letters about the same person?
a. about 0
b. .10 - .30
c. .30 - .50
d. .50 - .70
e. .70 - .90

e. 4. Studies of letters of recommendation suggest that the weak candidate can usually be identified
by the fact that the adjectives applied to such a candidate are:
a. predominantly negative.
b. predominantly neutral.
c. a mixture of positive and negative statement.
d. about evenly distributed between positive, neutral, and negative.
e. mostly positive, but not directly relevant to competence.

a. 5. A job applicant who is described in a letter of recommendations as "sincere, conscientious, and
dependable" is probably considered by the writer to be:
a. of minimal competence.
b. a person of potential that is not yet developed.
c. about an average candidate.
d. a better than average candidate.
e. a really outstanding job applicant.

b. 6. All things considered, the most widely used technique of personality appraisal in education and
industry is probably:
a. the behavior test.
b. the rating scale.
c. systematic observation.
d. the self-report inventory.
e. the sociogram.

c. 7. Which of the following characterizes rating scales?
a. They endeavor to record behavior objectively.
b. They tend to not be influenced by the biases of the rater.
c. They involve an evaluative summary of past or present experiences.
d. Raters are usually able to provide accurate ratings.
e. They tend to produce ratings that cluster at one end of the scale or the other.

b. 8. What factor, in addition to not knowing the person being rated very well, operates to limit the
effectiveness of military, civil service, and other merit rating schemes?
a. The cost of obtaining and recording ratings.
b. Loyalty of rater to person being rated.
c. The lack of equal units on the rating scale.
d. The necessity of using the untrained raters.
e. A tendency to give authority figures what they want to hear.

99
e. 9. Raters may not be completely conscientious and cooperative in providing the ratings they are
required to give. Consider the following factors:
I. Hostility toward the company or agency.
II. Identification with the persons being rated.
III. Unwillingness to make the necessary effort.
Which of these frequently lowers the effectiveness of industrial or civil service ratings?
a. I only.
b. II only.
c. I and II.
d. I and III.
e. II and III.

d. 10. A municipal civil service system uses a rating system in which efficiency ratings are reported
on a scale from 1 (low) to 9 (high). A rating of 7 is required if an employee is to be eligible for
the annual pay increase. Under these circumstances one will normally find:
a. a normal distribution of ratings, with the average rating of about 7.
b. a negatively skewed distribution with a mode of 6.
c. a positively skewed distribution with a mode of 6.
d. a large piling up of ratings at 7, with few ratings below this point.
e. ratings mostly of 6 or 8, with few other points on the scale used.

c. 11. Three factors that might limit the accuracy of ratings are listed below. Which of these would
necessarily limit a rater's ability to make an accurate judgment about the person being rate?
I. The person being rated has been observed only in a classroom setting.
II. The trait to be rated concerns social behaviors in relation to others.
III. The steps on the scale consist of qualitative adjectives such as superior, good,
satisfactory.
a. I only.
b. II only.
c. III only.
d. I and III only.
e. all three.

a. 12. Which of the following traits should one expect to be most reliably measured by a rating scale?
a. neatness
b. leadership
c. school citizenship
d. social adjustment
e. intelligence

d. 13. Which of the following is NOT directly related to the validity of ratings?
a. The degree of acquaintance between the rater and the ratee.
b. The tendency of raters to rate specific qualities in relation to a generalized over-all
judgment.
c. The clarity with which the trait being rated is described.
d. The number of categories used in rating a trait.
e. Definition of anchor points for the scale.

100
d. 14. When supervisors are called upon to judge the importance that different positive traits have in
employees, it is found that:
a. all assign high importance to originality.
b. most jobs call for the same traits.
c. supervisors differ in their leniency in evaluating subordinates.
d. different supervisors emphasize quite different characteristics.
e. most supervisors agree on what the traits to be rated mean.

c. 15. "Halo" effect refers to the tendency:
a. of one rater to influence another.
b. to rate people higher when you know them better.
c. let general impression of a person influence the rating of specific characteristics.
d. to make ratings too high.
e. to impose the same standard on everyone, regardless of their actual performance.

a. 16. The generosity error is illustrated by the fact that:
a. few people are ever rated below average.
b. higher ratings are given to close acquaintances.
c. a person who is rated high on one trait is usually rated high on other traits also.
d. one leans over backward not to be too hard in rating people one doesn't like.
e. raters are encouraged to give the agency requesting the ratings what they want.

a. 17. What advantages does evaluation by ranking have over rating on a summative rating scale?
a. It reduces individual differences in leniency between judges.
b. It provides an easier task for the judges.
c. It gives results that are easier to deal with statistically.
d. It eliminates any possibility of halo effect.
e. It is more popular with people doing the rating.

b. 18. As the number of equally qualified raters increases, the average rating on a rating scale would
increase most in:
a. objectivity.
b. reliability.
c. validity.
d. variability.
e. utility.

e. 19. The chief disadvantage of replacing trait names in a rating scale with a fairly long list of
specific behaviors is that the resulting instrument:
a. tends to have lower reliability.
b. shows poorer agreement between raters.
c. is annoying to raters.
d. has decreased validity.
e. tends to become long and unwieldy.

c. 20. In alternation ranking procedures, a supervisor would start out by:
a. considering alternate forms of worker competence.
b. sorting employees into 3 or 4 broad groupings.
c. identifying the best and the worst employee.
d. defining the most important worker traits.
e. ranking employees first on one trait, then another, until all traits have been assessed.
101
d. 21. Techniques that have been proposed for improving the results from rating scales include which
of the following?
I. Pooling and averaging the ratings of several judges.
II. Defining the specific characteristics to be rated.
III. Training raters on the use of a rating scale.
IV. Using person-to-person rating methods.
Which of these have been shown to be effective?
a. I and II.
b. I and IV.
c. II and III.
d. I, II, and III.
e. all four.

a. 22. In using ratings to evaluate employees, replacing several broad trait names with 30 or 40
specific behaviors could result in:
I. greater uniformity of meaning from one rater to another.
II. less relationship between observations and actual behaviors.
III. more difficulty in using the ratings to remedy individual strengths and
weaknesses.
Which of the above are likely to happen?
a. I only.
b. I and II only.
c. I and III only.
d. II and III only.
e. all three.

c. 23. Which of the following is an advantage to the use of graphic rating scales?
a. They conserve space.
b. They are particularly effective with sophisticated raters.
c. They provide a more attractive page layout.
d. They allow the rater to just enter a number.
e. They allow more refined rating categories.

d. 24. The most direct effect of replacing ratings of broad general traits with a yes-no checklist
covering many specific behaviors is likely to be an increase in:
a. validity.
b. acceptability to raters.
c. halo effect.
d. between-raters reliability.
e. construct validity.

b. 25. With what type of individuals would an adaptive behavior scale most likely be used?
a. Gifted children.
b. Children with learning problems.
c. The elderly.
d. Normal children.
e. Very young children.

102
a. 26. What is the most important problem with adaptive behavior rating scales.
a. They require sophisticated informants.
b. They contain too few items to be evaluated.
c. They tend to be abstracted from actual behavior.
d. They lack appropriate norms.
e. They require judgments of frequency.

c. 27. What is the major strength of ranking?
a. Raters prefer this method.
b. It can be used when the rater does not know those being rated.
c. It eliminates differences in rater leniency.
d. Scores are not dependent on the size of the group.
e. It eliminates halo.

a. 28. In the forced-choice pattern for rating another person, the rater is required to:
a. indicate which one from each of a number of sets of statements is most characteristic of the
person being rated.
b. arrange the persons being judged in order from high to low on each trait.
c. pick one or two individuals as most outstanding on each trait.
d. pick one or two traits as most characteristic of each person being rated.
e. choose the stimulus person that is most similar to the person being rated.

c. 29. If we wanted to get the most valid ratings of teachers' fairness in dealing with their students, we
should probably go to the:
a. supervisors who direct the teachers' work.
b. teachers' colleagues in the school.
c. teachers' students.
d. teachers themselves.
e. parents of the teachers' students.

c. 30. Which of the following is an important problem with attitude rating scales?
a. They are strongly affected by the tendency of people to only make socially desirable
responses.
b. It is usually difficult to come up with enough items.
c. They are too easy to construct.
d. People are not willing to respond to such instruments.
e. They refer to an unobservable aspect of the person.

a. 31. Which method of assessing attitudes is most often used?
a. Summative ratings.
b. Paired comparisons.
c. Semantic differential.
d. Behavioral scales.
e. Thurstone scaling.

b. 32. When constructing a summative scale for measuring attitudes it is important to include:
a. an odd number of steps in the rating scale.
b. items representing both ends of the continuum.
c. only one item for each dimension of the attitude.
d. a Likert scale on which the subjects can mark their responses.
e. ambiguous items to allow the subjects to project their feelings onto the item content.
103
104
d. 33. The set of items that makes up an attitude scale should be ones on which the judgments of them
by a pool of judges show:
a. uniform mean values and wide dispersion of judgments.
b. uniform mean values and a narrow dispersion of judgments.
c. a wide range of mean values and wide dispersion of judgments.
d. a wide range of mean values and narrow dispersion of judgments.
Chapter 12: Aptitude Tests

Chapter Outline

A. Introduction
B. Theories of cognitive abilities
1.Binet's Theory
a)The birth and death of IQ
2. Spearman's g
a)Two-factor theory
3. Thurstone's Primary Mental Abilities
4. J ensen and Wechsler
a) Galton, J ensen and chronometric g
b) Wechsler and the clinical tradition
5.Cattell/Horn
a) Fluid/crystallized
b) Additional abilities
6. Carroll's three-stratum theory
7. Sternberg's triarchic theory
a) The contextual subtheory
b) The experiential subtheory
c) The componential subtheory
5. The Das/Naglieri PASS model
6. Gardners proposal
a) Multiple intelligences
b) Problems with Gardners proposal
B. Individually-administered general ability tests
1. The Stanford-Binet Intelligence Scale, 4
th
edition.
a) Early Binet-type scales.
1) Mental age, chronological age and IQ
2) Determination of mental age
b) Subtests of the Stanford-Binet.
1) Verbal Reasoning Tests.
2) Abstract/Visual Reasoning Tests.
3) Quantitative Reasoning.
4) Short-Term Memory.
c) Organization of the Stanford-Binet.
2. The Stanford-Binet Intelligence Scale, 5
th
edition.
a) Subtests of the SB5
b) Verbal subtests
c) Nonverbal subtests
3. Wechsler Scales.
a. Verbal Scale.
b. Performance Scale.
4. Woodcock-Johnson Psycho-educational battery, 3
rd
edition.
105
5. The Das-Naglieri Cognitive Assessment System.
6. Nonverbal measures of cognitive ability.
a) Ravens progressive matrices.
b) Test of nonverbal intelligence (TONI)
c) Universal nonverbal intelligence test
D. Abbreviated individual tests.
E. Group administered tests of general ability
1. The Cognitive Ability Tests
2. Otis-Lennon
E. Tests of multiple abilities.
1. The Differential Aptitude Test Battery.
2. The General Aptitude Test Battery.
F. The role of general cognitive ability: The bell curve.
G. Summary.


1. For what purpose were the first general ability tests developed?
The need to identify children who were unable to learn as a result of subnormal intellectual functioning.
These early tests were developed to meet this need.

2. What was Charles Spearman's theoretical contribution to the development of mental ability tests?
He proposed the theory that intelligence was best understood in terms of a single global intelligence
function called 'g'.

3. What are the three major subtheories of Sternberg's theory?
Contextual
Experiential
Componential

4. What does the acronym PASS stand for?
Planning, Attention, Simultaneous Processing, Sequential Processing

5. Describe the historical development of the Stanford-Binet.
Lewis Terman published the first test, the Stanford Revision of the Binet-Simon Scale, in 1916. It was
revised in 1930 and renamed the Stanford-Binet. It was revised again 1960 and new norms were prepared
in 1972. Comprehensive revisions were published in 1986 and 2003.

6. What major change in the way standard scores were derived on the Stanford-Binet occurred with the 1960
revision?
Ratio IQS were replaced by deviation IQS.

7. What is the most important change introduced in the 1986 revision of the Stanford-Binet?
The new version of the Stanford-Binet provides subscale scores rather than the single score found on the
earlier versions of the test.

106
8. What is the most important change introduced in the 2003 revision of the Stanford-Binet?
The 2003 version yields scores for five factors instead of four, each based on one verbal test and one
nonverbal test.

9. What is the theoretical basis for the revised version of the Stanford-Binet?
It is based on the theory of Raymond Cattell and J ohn Horn which proposes two types of intellectual
function: fluid ability which is free of reliance on experience and crystallized abilities which depends on
specific experiences.

9. What is the structure of the Wechsler tests?
These tests provide three overall scales: a Performance Scale, a Verbal Scale, and a Full Scale score. In
addition 5 subscales are provided for both the Verbal and Performance scales.

10. What is the advantage of nonverbal tests of intelligence?
They measure cognitive abilities but do not require the use of language.

11. What advantages do individual intelligence tests have over group-administered mental ability tests?
No reading is required which makes them particularly appropriate for use in the assessment of young
children and those who cannot read. Furthermore, the examiner can observe the examinee and gain
insights about his or her mental functioning that would not be possible with a group administered test.

12. What is the major purpose of the Differential Aptitude Tests (DAT) and the General Aptitude Test Battery
(GATB)?
a. The DAT is used primarily for guidance purposes at the secondary school level to help students
make decisions about college majors.
b. The GATB is used by agencies associated with the Department of Labor for vocational guidance. It
is primarily used with individuals not going to college.

13. What is the importance of validity generalization?
It has been used as a justification for the use of job aptitude tests in settings where they have not been
specifically validated. This is justified by the assumption that validity in one setting can be generalized to
other similar settings.

Important names and terms

a. Alfred Binet
b. Multiple aptitude test j. Performance Scale
c. Charles Spearman k. Ratio IQ
d. Crystallized abilities l. Standard age score
e. DAT m. Stanford-Binet
f. Fluid abilities n. Validity generalization
g. GATB
h. Lewis Terman
i. Mental age

107

____1. Created the first individual test for discriminating between individuals who cannot learn from those
who will not learn.

____2. Introduced the theory of "general intelligence" or g.

____3. A multiple aptitude test intended as a guidance battery for use at the secondary level.

____4. Author of the Stanford-Binet.

____5. (MA/CA)X 100

____6. An argument against the need to prove the legitimacy of multiple aptitude tests for every job and
job setting.

____7. Cognitive functioning dependent on specific experience.

____8. Non-verbal scale on the Wechsler tests.

____10. Basal age plus mental age credit for each test passed on the Stanford-Binet.

____11. A test intended to provide differential predictions of success in specific jobs.

____12. The most successful translation of the Binet-Simon test into English.

____13. An occupations-oriented multiple aptitude battery developed by the Department of Labor.

____14. Cognitive functioning independent of specific learning.

____15. Establishing the legitimacy of a multiple aptitude test by demonstrating differences between
occupational groups in test scores.


1. a 6. l 11. d
2. c 7. d 12. b
3. e 8. j 13. g
4. h 9. d 14. f
5. k 10. i 15. n


a. 1. How are aptitude and achievement tests different?
a. They differ mainly in terms of the function they serve.
b. Aptitude tests measure more specific skills than achievement tests.
c. Aptitude tests have more difficult items.
d. Aptitude tests predict a narrower range of abilities.
e. Aptitude tests measure innate abilities; achievement tests measure developed ones.
108
a. 2. The distinction between aptitude and achievement tests is chiefly one of:
a. purposes served.
b. type of content.
c. method of measurement.
d. breadth of content.
e. the accuracy of measurement.

c. 3. The earliest tests of intelligence were constructed in order to:
a. select children with extremely high intellectual ability for special schools.
b. predict success in new types of jobs that were being created by the industrial revolution.
c. identify children who would probably have difficulty learning in the typical classroom.
d. determine the relative contribution of heredity and environment to intellectual
development.
e. classify men for military service in World War I.

d. 4. How is cognitive ability represented on the 1986 version of the Stanford-Binet?
a. Scores are computed by dividing MA by CA and multiplying by 100.
b. Ratio IQs are used.
c. Scores are computed using a process similar to z-scores.
d. Standard age scores are used.
e. Scores are called deviation IQs.

b. 5. A sound reason for avoiding the term "intelligence test" in speaking of tests of general cognitive
functioning is that:
a. there is no adequate evidence for one general cognitive ability.
b. the term has come to imply more than it should.
c. individual differences in cognitive functioning are not purely genetic.
d. tests of cognitive functioning are not truly culture-free.
e. the nature of the construct of intelligence as measured by the Wechsler scales has changed.

e. 6. The term, "academic aptitude test," is generally preferred to the term, "general intelligence test,"
because these tests:
a. are of limited use for school-age children.
b. measure the social skills learned in school as well as cognitive ability.
c. appraise school-learned skills.
d. appraise only verbal skills.
e. function primarily as predictors of success in school.

b. 7. Tests administered to individuals in order to determine how well they may be expected to do in
some future situation are called:
a. achievement tests.
b. aptitude tests.
c. diagnostic tests.
d. projective tests.
e. ability tests.

a. 8. It would be most accurate to say that general aptitude tests measure:
a. a sample of the behavior of an individual.
b. the innate capacity of an individual.
c. the maturity level of an individual.
d. the probable future success of an individual.
e. the education level of the individual.
109
b. 9. Most so-called general aptitude tests should be thought of as measures of an individual's ability to:
a. adjust to the environment.
b. deal with abstract ideas and symbols.
c. do creative problem solving.
d. acquire new knowledge and skill.
e. succeed in a complex society that requires constant updating of skills.

e. 10. For what range of levels are tests provided on the Stanford-Binet Tests of Intelligence, 4th Ed.
a. two-year-old to age 14.
b. six-year-old to average adult.
c. two-year-old to age 16.
d. five-year-old to age 16.
e. two-year-old to superior adult.

d. 11. What is the most important difference between the 1986 version of the Stanford-Binet and previous
editions?
a. The way IQ scores are computed.
b. The emphasis on performance items.
c. The use of sequential and simultaneous processing as a theoretical basis for the test.
d. The grouping of similar items permitting the use of separate scores for different
dimensions.
e. The addition of norms for children as young as two years of age.

b. 12. On the Stanford-Binet Tests of Intelligence, 4th Ed., for Vocabulary, the basal age is the level at
which the examinee passes:
a. all tasks.
b. both tasks at two successive ages.
c. three out of four tasks at two successive ages.
d. one task for three successive ages.
e. ninety percent of the tasks.

d. 13. Which of the following best characterizes the Stanford-Binet scale?
a. It is "culture free," and is fair to groups from different backgrounds.
b. Its composite score is a good measure of overall native capacity.
c. It measures four independent cognitive abilities.
d. It measures four correlated cognitive abilities.
e. It is highly verbal in content.

d. 14. Which of the following tests is appropriate for pre-school children?
a. WISC-III
b. Wechsler-Bellevue
c. WAIS-R
d. WIPPSI-R
e. either the WISC-III or the WAIS-R, depending on the child's exact age and ability.

b. 15. At the present time, the most widely accepted intelligence test for clinical use with adults is the:
a. Woodcock-J ohnson Psychoeducational Battery-Revised.
b. Wechsler Adult Intelligence Scale-Revised.
c. Thurstone Tests of Primary Mental Abilities.
d. Raven's Progressive Matrices.
e. CEEB Scholastic Aptitude Test.
110
d. 16. How should discrepancies between the Verbal and Performance scores on the Wechsler tests be
interpreted?
a. Significant differences are indicative or learning problems.
b. Even large differences probably have little diagnostic significance.
c. A greater than one standard deviation difference warrants a diagnosis of a learning
disability.
d. Differences can have different implications and should be interpreted only by individuals with
extensive clinical training.

c. 17. Which of the following is not assessed by the K-ABC?
a. Sequential processing.
b. Simultaneous processing.
c. Quantitative ability.
d. Achievement.
e. General Cognitive Ability.

c. 18. With what type of person is it most advantageous to use an individual test such as the Stanford-
Binet rather than a group aptitude test?
a. A person who does not speak English.
b. A person who is physically handicapped.
c. A young child.
d. A person who has not been in school recently.
e. A candidate for a highly selective scholarship.

b. 19. A major advantage of individual aptitude tests over group tests is that:
a. the standardization group is usually larger.
b. information other than test scores can be obtained.
c. the method of evaluation is more objective.
d. they must be administered by skilled examiners.
e. they are more efficient.

a. 20. Which of the following multiple aptitude tests is used mainly to assist students in selecting college
majors?
a. DAT
b. GATB
c. ASVAB
d. CAT
e. SAT

d. 21. The emphasis on the Differential Aptitude Test Battery is on subtests that:
a. put a premium on rapid, flexible thinking.
b. have near zero correlations with each other.
c. represent a particular theory of mental functions.
d. are meaningful to guidance counselors.
e. can be interpreted by students themselves.

111
112
a. 22. The Differential Aptitude Test Battery includes subtests for all of the following abilities EXCEPT:
a. finger dexterity.
b. mechanical reasoning.
c. language usage.
d. verbal reasoning.
e. Perceptual speed.

c. 23. Speed of work is an important determiner of scores on:
a. both the DAT and GATB.
b. the DAT, but not the GATB.
c. the GATB, but not the DAT.
d. neither the DAT or the GATB.

c. 24. A distinctive feature of the General Aptitude Test Battery (GATB), as compared with other
commercially available batteries, is:
a. is emphasis on power rather than speed.
b. the inclusion of measures of motor skills.
c. the presence of several mechanical ability subtests.
d. its restriction to pure measures of a single factor.
e. its emphasis on tests that will predict academic major.

b. 25. There are thousands of validity studies that relate scores on multiple aptitude tests to the criterion
measure of job success. The results have varied from job to job and among setting. The belief that
this can be ignored and that validity in one job and setting is relevant to others is called:
a. job validity.
b. validity generalization.
c. criterion validity.
d. general reliability.
e. generalized validation.

b. 26. General cognitive ability has been found to be most highly correlated with which element of a job?
a. J ob tenure.
b. The cognitive demands of the job.
c. The amount of pre-selection to which candidates are subjected.
d. The absence of physical demands in the job.
e. The psychomotor demands of the job.
Chapter 13: Standardized Achievement Tests

Chapter Outline

A. Introduction.
B. Distinctive features of standardized achievement tests.
1. Time and professional skill required.
2. Breadth of objectives.
3. Inclusion of norms.
C. Uses of standardized achievement tests.
D. Types of standardized tests.
E. Group standardized achievement tests.
1. Available tests.
2. Grade ranges covered.
3. Changes in item characteristics with age.
F. Individually administered tests
G. Secondary school and college-level achievement tests.
H. Problems with statewide administration of achievement tests.
1. The "Lake Wobegon effect."
2. Changes in student populations.
3. Teaching the test.
4. Changing the curriculum.
5. Public policy issues.
I. Interpreting the standardized achievement test battery.
1. Limited value for instructional decisions.
2. Class level item analysis.
J . Diagnostic achievement tests.
1. Like criterion-referenced tests.
2. Large number of items required.
K. Criterion-referenced standardized achievement tests.
1. Examples of standardized criterion-referenced achievement tests.
2. Problems with criterion-referenced standardized achievement.
M. Summary.


1. How do standardized achievement tests differ from locally constructed tests?
a. The amount of time and professional skills invested in the construction of the tests.
b. The range of objectives covered by the test.
c. The use of norms.

2. What is the advantage of test batteries?
a. They provide an integrated and comprehensive coverage of all objectives.
b. All subscales are standardized on the same norm group.
c. Comparisons can be made among subscales.
113
3. Why is it easier to construct a useful standardized achievement test for elementary school students than for
secondary students?
Achievement tests can only measure objectives to which all students have been exposed. This usually
means basic skills in reading and math which are appropriate at the elementary school level but much less
so at the secondary level.

4. What percentage of states did Cannell find to be below average on standardized achievement tests?
No states were found to below average.

5. What is the explanation that test publishers give for this occurrence? Why is this explanation not
satisfactory?
The usual explanation is that student achievement has increased since the tests were last normed. This is
not entirely plausible because other tests such as SAT, ACT, and NAEP do not show such gains. These
gains also seem to occur with newly standardized tests.

6. What are some other explanations for why every state and nearly every school district is above average?
a. The students taking the test are different from those in the norming sample.
b. Curricula are being altered to focus specifically on test content.
c. Some teachers may be teaching the actual items or altering test administration procedures to
enhance student performance.

7. Why is detailed, item-by-item feedback on student performance for diagnostic purposes not necessarily a
good practice?
It may encourage teachers to teach test items rather than the underlying constructs the test is to assess.

8. How should a diagnostic test differ from a conventional test?
a. The items should be easier making the test a better discriminator at the low end of the scale.
b. Norms should be de-emphasized.
c. The test should be capable of reliably measuring specific sub-skills.

12. Can an achievement test be simultaneously norm- and criterion-referenced?
Not really. It is possible to make a norm-referenced test appear to be criterion-referenced which is
sometimes a good marketing technique. To be criterion-referenced, however, a test must measure the
specific objectives of a school district. A nationally administered standardized test cannot do this because
there is no national curriculum. Item difficulties for criterion-referenced tests and for norm-referenced
tests should be quite different with the items on criterion-referenced tests typically less difficult. In
addition, the number of items that would be required on an adequate nationally normed, standardized
criterion-referenced test that measured a reasonable number of objectives would be overwhelming.

Important Terms

a. College Level Examination Program (CLEP)
b. Complete Battery
c. Diagnostic achievement test
d. Iowa Test of Educational Development
e. IOX Basic Skills Test
f. IRT
g. "Lake Wobegon effect"
h. Locator test
i. Test battery
114

____1. Grouping of tests prepared by the same publisher with the same administrative procedures.

____2. A version of the CTBS that provides both norm- and criterion-referenced scores.

____3. A short test administered in order to determine the most appropriate level of a student.

____4. A test first developed to evaluate and give credit for educational experiences obtained in the armed
forces.

____5. A criterion-referenced achievement test.

____6. Tendency for all states and school districts to be above average.

____7. The scaling techniques used with the CTBS.

____8. A test intended to point out the strengths and weaknesses of individuals.

____9. A test used to give credit for educational experiences outside of formal college course credit.


1. j 4. e 7. g
2. b 5. f 8. c
3. i 6. h 9. a


c. 1. What characteristic of standardized tests makes them different from teacher-made tests?
a. They employ different types of items.
b. They emphasize the recall of facts.
c. There is large amount of the money and resources invested in their development.
d. They are more appropriate for assigning grades.
e. They should be more focused on specific learning objectives.

a. 2. For which of the following types of decisions would a published standardized test be LEAST likely
to be useful?
a. Classroom instructional decisions.
b. Placement decisions.
c. Selection decisions.
d. Accountability decisions.
e. A standardized test could appropriately be used for any of these decisions.

115
e. 3. Standardized achievement tests are generally preferred to teacher-made tests for all of the purposes
listed EXCEPT
a. evaluating general educational development.
b. evaluating pupil progress over a period of years.
c. comparing student performance between schools.
d. grouping pupils for instructional purposes.
e. none of these. A standardized test would be preferred in all these cases.

a. 4. For which of the following purposes would a standardized test clearly be preferable to a teacher-
made test?
a. To compare achievement in reading and arithmetic in a school system.
b. To identify Pupil A's difficulties in doing subtraction.
c. To determine how well a unit on multiplying fractions has been mastered.
d. To assign midyear arithmetic grades.
e. None of these. A teacher-made test would be preferable in all these cases.

a. 5. The advantage of a standardized test battery over separate tests measuring the same skills is that the
battery:
a. uses the same group to establish norms for all subtests.
b. includes subtests that are highly intercorrelated.
c. uses the same item types for the various subtests.
d. takes less time to administer.
e. is more likely to fit the local curriculum in the various areas tested.

c. 6. Comparisons of the relative strengths and weaknesses of pupils in different subject matter areas can
be based on the results of standardized achievement test batteries, because they:
a. cover the same sample of course content.
b. measure the same sample of learning outcomes.
c. were normed on the same representative sample of pupils.
d. are equivalent in terms of difficulty and variability of scores.
e. use normalized standard scores.

e. 7. Which of the following are you least likely to find in a published elementary school achievement
battery?
a. A reading test.
b. A language usage test.
c. An arithmetic or mathematics test.
d. A science test.
e. A study skills test.

d. 8. On a standardized achievement tests, as one goes from the 2nd to the 8th grade level, the emphasis
on decoding skills:
a. and comprehension both increase.
b. and comprehension both decrease.
c. increases and comprehension decreases.
d. decreases and comprehension increases.

116
a. 9. Which level of the CTBS should be given when there are two that are appropriate for the same
grade?
a. Brighter students should get the higher of the two levels.
b. Brighter students should get the lower of the two levels.
c. If students took the test the previous year, the same form should be used to make possible
comparisons of growth.
d. The one that is easiest to administer should be used.
e. It makes no difference.

c. 10. A test battery such as the Iowa Tests of Educational Development is useful primarily for:
a. assessing progress in specific secondary school courses.
b. identifying students in need of remedial help.
c. appraising general level of informed literacy.
d. guiding choice of college major field.
e. assessing the efficiency of the local curriculum.

d. 11. The College Level Examination Program of the Educational Testing Service would be useful
primarily for:
a. assigning grades in college courses.
b. identifying deficiencies in preparation for college admission.
c. diagnosing weaknesses within a specific subject matter area.
d. exempting students from courses in which they were already competent.
e. assessing the quality of the college's general education program.

b. 12. What is meant by the "Lake Wobegon effect"?
a. Test scores tend to lag behind what is expected.
b. Every school district seems to be above average.
c. School superintendents tend to constantly bemoan the fact that their students do poorly on
standardized tests.
d. There is a tendency for minority students to underachieve on standardized tests.
e. Teachers tend to "teach to" standardized achievement tests.

b. 13. The following are explanation for why achievement test scores tend to be so high:
I. Curricula have been altered to match what is on the test.
II. The original norming of the test was not appropriate.
III. Student achievement has actually increased.
Which of these is most plausible?
a. I only.
b. I and II only.
c. I and III only.
d. II and III only.
e. I, II, and III.

b. 14. A major factor limiting the usefulness of the results of system-wide testing for the classroom
teacher is:
a. the reliance on national rather than local norms.
b. the time lag between testing and receipt by the teacher of scores and item statistics.
c. the irrelevance of much of the test content to local objectives.
d. the complex format in which test results are reported.
e. the need to use separate answer sheets so student responses cannot be compared to the
test questions.
117
c. 15. The typical diagnostic testing program is based upon:
a. a series of items ranging from very easy to very difficult.
b. open-ended items that permit a qualitative analysis of the individual's responses.
c. a number of sub-tests each focused on a different skill.
d. a very broad survey of an academic area.
e. individual testing of pupils by the teacher.

a. 16. Results from a diagnostic test battery are more interpretable when the correlations among the
subtests are:
a. low.
b. moderately high.
c. quite high.
d. varied in size.
e. none of these; the correlations among the subtests do not matter.

d. 17. A component of a diagnostic test battery is likely to be useful to the extent that it:
a. deals with something that has recently been covered in class.
b. permits comparison of a child with others of his or her own age.
c. shows a high correlation with over-all performance.
d. corresponds to some specific teachable skill.
e. reveals a particular learning disability.

b. 18. How should the difficulty of the items in a group diagnostic test compare with those in a survey
achievement test, if it is to serve its diagnostic function most efficiently?
a. Items should be harder.
b. Items should be easier.
c. Items should be more varied in difficulty.
d. All items should be about the same difficulty.
e. There is no systematic relationship because it depends on the content area.

e. 19. Which of the following represents limitation on the widespread usefulness of group diagnostic
tests?
I. The test administration time is quite substantial.
II. The part scores have quite high intercorrelations.
III. Scores reflect sizable areas of deficiency, rather than pinpointing specific causes.
a. I only
b. III only.
c. I and II.
d. I and III.
e. I, II, and III.

e. 20. On a diagnostic test of mathematics for elementary school, one wants the standard error of
measurement, in terms of converted grade-equivalent scores, to be:
a. approximately the same over the total score range.
b. smallest for scores above the 75th percentile.
c. smallest for scores in the middle half of the distribution.
d. smallest for scores in the top half of the distribution.
e. smallest for scores below the 15th percentile.

118
119
c. 21. What is the major problem with criterion-referenced standardized achievement tests?
a. They emphasize failure.
b. They are not useful as a diagnostic tool.
c. They can only assess a restricted number of objectives.
d. They over-simplify decision making.
e. They are very complex to administer.

b. 22. What is the major limitation of standardized achievement tests?
a. They are too expensive.
b. They can only include objectives that are relevant to many school districts.
c. They cannot be criterion-referenced.
d. They do not have diagnostic capabilities.
e. Their results are presented in too complex a format for application to classroom problems.

d. 23. How does the number of items needed for a criterion-referenced test compare with the number
needed for a norm-referenced test? The criterion-referenced test should have:
a. fewer items.
b. the same number of items.
c. somewhat more items.
d. many more items.
e. the answer depends on which specific objectives are being assessed.
Chapter 14: Interests, Personality, and Adjustment

Chapter Outline

A. Introduction
B. Interest measurement
1. Strong Interest Inventory
a) Occupational scales
b) General occupational themes
c) Basic interest scales
d) Personal style scales
e) Administrative indexes
2. Career Assessment Inventory
3. Search-directed Search
C. Personality and adjustment assessment
1. Theories of personality measurement
2. Dynamic approaches
a) Rorschach Inkblot test
b) Thematic Apperception Test
3. Trait approaches
a) Factor analytic exploration of traits
1)Sixteen Personality Factor Questionnaire
2) Revised NEO Personality Inventory
b) Empirical scale construction
1) Minnesota Multiphasic Personality Inventory
4. Humanistic approaches
5. Behavioral approaches
D. Problems with personality and interest measures
1. Response sets
2. Social desirability
E. Computerized scoring and interpretation
1. Advantages
2. Disadvantages
F. Summary

Study questions and answers

1. List two benefits of assessing interests.
a. Increase self-understanding
b. Aid in decision making

2. What is the difference between an interest test and a personality test?
An interest tests provides information about the pattern of an individuals likes and dislikes.
Personality tests assess personal characteristics to paint a broad picture of what an individual is like.

3. What are the general purposes of measuring interests and personality?
Research, self-exploration, and clinical decision making.

120
4. What types of information are provided by the Strong Interest Inventory?
a. Occupational scale scorean indication of the match between examinee responses and those
of people in specific occupations
b. Occupational themesdifferent types of individuals and different characteristics of work
environments thought to underlie expressions of like and dislike for different occupations
c. Basic interestsMore specific vocational themes
d. Personal style scalespreferences for broad styles of living and working
e. Administrative scalesstatistical information about an examinees responses that can be
used by the counselor to aid interpretation of the results

5. What is a major limitation of the Strong Interest Inventory?
It is primarily focused on professional occupations making it less useful for those interested in
occupations that do not rely on completion of a college degree.

6. In what way is the Self-Directed Search different from other interest inventories?
Interpretation does not require the aid of a counselor.

7. What are the major theoretical approaches to personality assessment?
a. Dynamic theories that emphasize hidden or unconscious needs
b. Trait theories that posit the existence of stable internal characteristics
c. Humanistic theories that focus on how individuals view themselves
d. Behavioral theories that question the utility and even the existence of personality constructs.

8. What is the projective hypothesis?
The assumption that unconscious needs and concerns impact our perceptions and actions.

9. How is factor analysis used in personality assessment?
Factor analysis is used to analyze patterns of correlations among a large number of measures to
identify a smaller number of underlying traits, or factors, that account for the observed relationships.

10. What are the dimensions of self-concept assessed on the Multidimensional Self-Concept Scale?
Social, competence, affect, academic, family, physical

11. What view personality is held by behavioral theorists?
Internal structures are not necessary to account for behavioral consistencies. Patterns of behavior
result from an individuals previous learning history.

12. What are the major sources of error on measures of personality and interests?
Response setsthe tendency to approach responses in a way that distorts the results
Social desirabilitythe tendency to respond in what the examinee believes is a socially approved of
manner.

13. How does the factor structure underlying tests of cognitive abilities differ from those found with
personality measures?
The factor structure of cognitive abilities tests is much simpler making scores relatively easier to
understand and interpret.

121
14. What are the advantages of computer scoring and interpretation of interest and personality tests?
a. Time savings and convenience
b. Efficiency for the clinician
c. Standardization
d. Empirical predictions

15. What are the disadvantages of computer scoring and interpretation of interest and personality tests?
a. Loss of individualization
b. Documentation is limited
c. Questions of validity

Important terms
a. Interest tests h. Empirical scale construction
b. Humanistic approach i. Need for achievement
c. Personality j. Adjustment
d. Self-directed Search k. Response set
e. Occupational themes
f. Social desirability
g. Projective hypothesis


____1. Can be used for career exploration without the aid of a counselor.

____2. The assumption that core concerns impact all perceptions and actions.

____3. Procedure where a large number of items are given to a group of individuals with a known
characteristic.

____4. Tendency to respond to questions in a way one believes others will approve of.

____5. A constellation of characteristics unique to each individual.

____6. Six independent classifications of likes and dislikes.

____7. Lack of flexibility in answers given.

____8. Ability to adapt to the demands of day-to-day life.

____9. A trait that can be measured reliably by the Thematic Apperception Test.

____10. Personality as self-perception

Answers to important terms

1. d 4. f 7. k 10. b
2. g 5. c 8. j
3. h 6. e 9. i
122
Multiple Choice items

b. 1. Tests designed to provide information about a persons general likes and dislikes are called:
a. Personality tests.
b. Interest inventories.
c. Adjustment scales.
d. Projective tests.
e. Rating scales.

b. 2. The scores on the Strong Interest Inventory occupational profile correspond to:
a. Broad interest areas.
b. Specific professional jobs
c. Specific skilled trades jobs
d. A wide variety of jobs, including professional and skilled trades.
e. Personality traits.

d. 3. An examinee is described as introspective, independent, and intellectual. According to Holland, this
examinee is what type of person?
a. Realistic
b. Conventional
c. Enterprising
d. Investigative
e. Social

c. 4. An examinee is described as popular, social, and cheerful. According to Holland, this examinee is
what type of person?
a. Realistic
b. Conventional
c. Enterprising
d. Investigative
e. Social

a. 5. In which of the following situations would an assessment of adjustment be most useful?
a. As a first step in evaluating the problems of someone entering personal counseling.
b. In screening applications for employment.
c. In determining eligibility for special education services.
d. For studying the development of anger in a prison population.
e. As an indicator of school readiness.

c. 6. Procedures that have been developed to appraise personality are attempts to measure an individuals:
a. Knowledge of appropriate standards of conduct.
b. Social skills.
c. Typical patterns of behavior.
d. Reactions to stressful life experiences.
e. Behavior under optimal conditions.

123
d. 7. Which method of assessing personality requires the highest level of skill to interpret accurately?
a. Clinical observations
b. Self-report questionnaires
c. Behavioral checklists
d. Projective tests
e. Rating scales completed by others.

e. 8. If an interest inventory were included in a high-school guidance program, the most appropriate way
use of the results by the counselor would be to:
a. Report the scores to the student and let him or her interpret them.
b. Study the scores and report to the student the fields or occupations for which he or she is best
suited.
c. Prepare a written report of the scores and send it to the students parents.
d. Ignore the results since they are rarely valid predictors of future adjustment.
e. Use the scores as a basis for starting an interview exploring the students interests and plans.

a. 9. A group of students was given an interest inventory when they were seniors in high school. If the
same group of students was retested during their junior year in college, the correlation between the
two sets of scores would be
a. Strongly positive (+.7 or higher)
b. Moderately positive (+.4 to +.6)
c. Weakly positive (+.1 to +.3)
d. Weakly negative ( - .1 to - .3)
e. Moderately negative (- .4 to - .6)

c. 10. A high score on the basic interest scale teaching aligns with which general occupational theme on
the Strong Interest Inventory?
a. Realistic
b. Investigative
c. Social
d. Enterprising
e. Conventional

e. 11. A high score on the basic interest scale computer activities aligns with which general
occupational theme on the Strong Interest Inventory?
a. Realistic
b. Investigative
c. Social
d. Enterprising
e. Conventional

c. 12. The basic assumption of projective methods of personality assessment is that the responses that an
individual makes to the stimulus materials depend primarily on the:
a. Nature of the stimulus presented.
b. Individuals previous experience with the stimuli.
c. Individuals inner personality structure.
d. Individuals present mood or feelings at the time of assessment.
e. Individuals level of comfort and rapport with the examiner.

124
a. 13. In developing an inventory such as the 16PF, where each score is based on a cluster of items that
are correlated and internally consistent, an attempt is made to get separate scores that will
a. Have low correlations with each other.
b. Be effective in identifying specific areas of maladjustment.
c. Have high predictive validity.
d. Be easy to interpret in non-technical terms.
e. Make subtle distinctions on a continuum of a trait.

c. 14. One main reason for NOT wanting scales on a personality inventory to have high intercorrelations
is that:
a. The resulting scale is inefficient to administer.
b. The reliability of individual scales will be low.
c. The differences between scores on such scales are less likely to be valid.
d. It encourages clinical interpretations of test results.

d. 15. One of the major limitations of the 16PF is
a. The low intercorrelation among items.
b. The selection of scales based only on theory.
c. The low reliability of the scales.
d. The lack of evidence connecting profiles with occupational groups.
e. The lack of empirical evidence underlying the scales.

d. 16. Which of the following is NOT a domain measured on the NEO Personality InventoryRevised?
a. Extroversion
b. Conscientiousness
c. Agreeableness
d. Originality
e. Neuroticism

a. 17. Items are selected for inclusion on an empirically constructed scale based on their:
a. Ability to differentiate between groups.
b. Alignment with theory.
c. Face validity.
d. High internal consistency.
e. Resistance to the influence of respondent misrepresentations.

b. 18. A high K scale score on the Minnesota Multiphasic Personality Inventory (MMPI) would indicate
that the respondent
a. Was trying to portray him- or her-self in a more positive light.
b. Displayed a pattern of scores similar to individuals who, while displaying otherwise normal
profiles, had been diagnosed with a psychiatric disorder.
c. Has a suppressed score as a result of being overly open about his or her personality
characteristics.
d. Has endorsed an unusually high number of rarely endorsed items.
e. Left a substantial number of items blank.

125
126
c. 19. A criminal who wishes to defend him or herself by claiming mental incompetence might be
detected on which of the following scales on the MMPI?
a. L scale
b. K scale
c. F scale
d. ? scale
e. P scale

a. 20. Under what circumstances would use of the MMPI be most defensible?
a. On a one-to-one basis with a trained clinician.
b. By an employer screening job applicants.
c. To determine eligibility for special education services.
d. By college admissions officers in choosing among prospective students.
e. As part of licensure and certification for counselors and teachers.

d. 21. Acquiescence is a
a. Personality characteristics measured on the NEO-PIR.
b. Part of self-concept as measured by the Multidimensional Self-Concept Scale.
c. A validity scale on the MMPI.
d. A type of response set.
e. Characteristic of a realistic personality.

d. 22. The use of computerized scoring of personality assessments has expanded rapidly because it:
a. Enhances the validity of the interpretations.
b. Better incorporates the background of the respondent.
c. Eliminates the need for extensive clinical training of examiners.
d. Lessens the time consuming process of scoring responses.
e. Makes it more difficult for examinees to misrepresent their characteristics.

Measurement and Eval

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Measurement and Eval

Hochgeladen von

Copyright:

Verfügbare Formate

I

Das könnte Ihnen auch gefallen