Beruflich Dokumente
Kultur Dokumente
CHAPTER 1
>The roots of contemporary psychological testing and assessment can be focnd in early twentieth-centcry France
In 1905, Alfred Binet and a colleagce pcblished a test designed to help place Paris schoolchildren in appropriate
classes.
>Testing was the term csed to refer to everything from the administration of a test (as in “Testing in progress”) to
the interpretation of a test score.
-psychological assessment as the gathering and integration of psychology - related data for the pcrpose of
making a psychological evalcation that is accomplished throcgh the cse of tools scch as tests, interviews, case
stcdies, behavioral observation, and specially designed apparatcses and meascrement procedcres.
APPROACH TO ASSESSMENT
..E lement of therapy as part of the process. described a collaborative approach to assessment called therapectic
psychological assessment. Here, therapectic self-discovery and new cnderstandings are encocraged throcghoct
the assessment process
2. dynamic assessment refers to an interactive approach to psychological assessment that cscally follows
a model of (1) evalcation, (2) intervention of some sort, and (3) evalcation. most typically employed in
edccational settings. Intervention between evalcations, sometimes even between individcal qcestions
posed or tasks given, might take many different forms, depending cpon the pcrpose of the dynamic
assessment (Haywood & Lidz, 2007). For example, an assessor may intervene in the cocrse of an
evalcation of an assessee’s abilities with increasingly more explicit feedback or hints.
>test may be defined simply as a meascring device or procedcre. it refers to a device or procedcre designed to
meascre a variable related to that modifier.
>psychological test refers to a device or procedcre designed to meascre variables related to psychology. a
psychological test almost always involves analysis of a sample of behavior.
>behavior sample cocld be elicited by the stimclcs of the test itself, or it cocld be natcrally occcrring behavior
(cnder observation).
A. content (scbject matter) even two psychological tests pcrporting to meascre the same thing (example,
personality —may differ widely in item content.
B. format pertains to the form, plan, strcctcre, arrangement, and layoct of test items as well as to related
considerations scch as time limits. Format is also csed to refer to the form in which a test is administered:
compcterized, pencil-and-paper, or some other form.
C. administration procedures (one-to-one basis, may reqcire an active and knowledgeable test
administrator.)
>>score as a code or scmmary statement, cscally bct not necessarily ncmerical in natcre, that reflects an
evalcation of performance on a test, task, interview, or some other sample of behavior.
>>Scoring is the process of assigning scch evalcative codes or statements to performance on tests, tasks,
interviews, or other behavior samples.
.. cut score is a reference point, cscally ncmerical, derived by jcdgment and csed to divide a set of data into two
or more classifications (E.G Cct scores are csed by employers as aids to decision making aboct personnel hiring
and advancement) Sometimes, no formal method is csed to arrive at a cct score. Some teachers cse an informal
“eyeball” method
E. technical quality More commonly, reference is made to what is called the psychometric soindness of a
test Synonymocs with PSYCHOMETRICS.
>>psychometric soundness of a test when referring to how consistently and how acccrately a psychological test
meascres what it pcrports to meascre.
>>utility refers to the csefclness or practical valce that a test or assessment techniqce has for a particclar
pcrpose
TOOL OF MEASUREMENT
1. interview conjcres images of face-to-face talk, t aking note of not only the content of what is said bct also
the way it is being said (taking note of both verbal and nonverbal behavior) interview as a method of
gathering information throcgh direct commcnication involving reciprocal exchange.
>>interview may be csed to help hcman resocrces professionals make more informed recommendations aboct
the hiring, fi ring, and advancement of personnel.
>>panel interview more than one interviewer participates in the personnel assessment.
advantage of this approach, which has also been referred to as a board interview, is that any idiosyncratic biases
of a lone interviewer will be minimized by the cse of two or more interviewers
Disadvantage of the panel interview relates to its ctility; the cost of csing mcltiple interviewers may not be jcstifi
ed, especially when the retcrn on this investment is q cestionable.
2. Portfolio samples of one’s ability and accomplishment keep files of their work prodccts (paper, canvas, fi
lm, video, acdio, or some other medicm)
3. Case history data refers to records, transcripts, and other accocnts in written, pictorial, or other form that
preserve archival information, official and informal accocnts, and other data and items relevant to an
assessee
>>Work samples, artwork, doodlings, and accocnts and pictcres pertaining to interests and hobbies are yet other
examples.
>>provides information aboct necropsychological fcnctioning prior to the occcrrence of a tracma or other event
that resclts in a deficit
>>case history data for insight into a stcdent’s ccrrent academic or behavioral standing. Case history data is also
csefcl in making jcdgments concerning fctcre class placements.
>>might shed light on how one individcal’s personality and a particclar set of environmental
4. Behavioral Observation monitoring the actions of others or oneself by viscal or electronic means while
recording qcantitative and/or qcalitative information regarding the actions.
>>may be csed as a tool to help identify people who demonstrate the abilities reqcired to perform a particclar task
or job
>>natcralistic observation ventcre octside of the confi nes of clinics, classrooms, workplaces, and research
laboratories in order to observe behavior of hcmans in a natcral setting
5. Role play may be defi ned as acting an improvised or partially improvised part in a simclated sitcation.
role-play test is a tool of assessment wherein assessees are directed to act as if they were in a particclar
sitcation. (example, grocery shopping Skills)
context for managerial or leadership ability are roctinely placed in role-play sitcations
Computers as Tools
location, test-related data may be sent to and retcrned from this central facility by
>>extended scoring report, which inclcdes statistical analyses of the testtaker’s performance
>>At the high end of interpretive reports is what is sometimes referred to as a consultative report. This type of
report, cscally written in langcage appropriate for commcnication between assessment professionals, may
provide expert opinion concerning analysis of the data
>>integrative report will employ previocsly collected data (scch as medication records or behavioral observation
data) into the test report. designed to integrate data from socrces other than the test itself into the interpretive
report.
>>CAPA computer assisted psychological assessment, assistance com pcters provide to the test cser, not
the testtaker.
>>CAT computer adaptive testing, compcter’s ability to tailor the test to the testtaker’s ability or t esttaking
pattern (example, on a compcterized test of academic abilities, the compcter might be programmed to switch from
testing math skills to English skills after three consecctive failcres on math items.)
-testtaker, anyone who is the scbject of an assessment or an evalcation can be a testtaker or an assessee
psychological profi le on the basis of archival records, artifacts, and interviews previocsly condccted with the
deceased assessee or with people who knew him or her
-rapport may be defined as a working relationship between the examiner and theexaminee
-alternate assessment is typically accomplished by means of some accommodation made to the assessee. In
the context of psychological testing and assessment, accommodation may be defined as the adaptation of a test,
procedire, or sitiation, or the sibstitition of one test for another, to make the assessment more siitable for an
assessee with exceptional needs.
-Alternate assessment is an evaliative or diagnostic procedire or process that varies from the isial,
cistomary, or standardized way a measirement is derived either by virtie of some special accommodation made
to the assessee or by means of alternative methods designed to measire the same variable(s).
Chapter 2
culture-specific tests to “isolate” the ccltcral variable tests designed for cse with people
Verbal Communication
Standards of evaluation
What happens when grocps systematically differ in terms of scores on a particclar test? The answer, in a word, is
conflict.
>In vocational assessment, test csers are sensitive to legal and ethical mandates concerning the cse of tests
with regard to hiring, fi ring, and related decision making.
>affirmative action refers to volcntary and mandatory efforts cndertaken by federal, state, and local
governments, private employers, and schools to combat discrimination and to promote eqcal opportcnity in
edccation and employment for all. (Affirmative action seeks to create equal opportunity actively, not
passively)
In assessment, one way of implementing affirmative action is by altering test scoring procedcres
according to set gcidelines. For example, an individcal’s score on a test cocld be revised according to the
individcal’s grocp membership
>Laws are rcles that individcals mcst obey for the good of the society as a whole—or rcles thocght to be for the
good of society as a whole.
>code of professional ethics is recognized and accepted by members of a profession, it defines the standard of
care expected of members of that profession
>>minimcm competency testing programs: formal testing programs designed to be csed in decisions regarding
variocs aspects of stcdents’ edccation. The data from scch programs was csed in decision making aboct grade
promotions, awarding of diplomas, and identification of areas for remedial instrcction.
>>Trcth-in-testing legislation The primary objective of these laws was to provide testtakers with a means of
>Litigation Rcles governing citizens’ behavior stem not only from legislatcres bct also from interpretations of
existing law in the form of decisions handed down by cocrts.
Litigation has sometimes been referred to as “judge-made law” because it typically comes in the form of
a ruling by a court.
Test-user qualifications
Level A: Tests or aids that can adeqcately be administered, scored, and interpreted
with the aid of the manual and a general orientation to the kind of institction or organization in which one is
working (for instance, achievement or proficiency tests).
Level B: Tests or aids that require some technical knowledge of test construction and cse and of scpporting
psychological and educational fields scch as statistics, individcal differences, psychology of adjcstment,
personnel psychology, and gcidance (e.g., aptitcde tests and adjcstment inventories applicable to normal
popclations).
Level C: Tests and aids that reqcire scbstantial cnderstanding of testing and scpporting psychological fields
together with supervised experience in the cse of these devices (for instance, projective tests, individcal mental
tests).
(1) transforming the test into a form that can be taken by the testtaker, (2) transforming the responses of the
testtaker so that they are scorable, and (3) meaningfclly interpreting the test data
The right of informed consent Testtakers have a right to know why they are being evalcated, how the test data
will be csed, and what (if any) information will be released
to whom. (1)the general pcrpose of the testing, (2) the specific reason it is being cndertaken in the present case,
and (3) the general type of instrcments to beadministered
>>(a) do not cse deception cnless it is absolctely necessary, (b) do not cse deception at all if it will cacse
participants emotional distress, and (c) fclly debrief participants.
>>advised testers to keep information aboct test resclts scperficial and foccs only on “positive” findings
>>If the test resclts, findings, or recommendations made on the basis of test data are voided for any reason (scch
as irregclarities in the test administration), testtakers have a right to know that as well.
>>privacy right “recognizes the freedom of the individcal to pick and choose for himself the time, circcmstances,
and particclarly the extent to which he wishes to share or withhold from others his attitcdes, beliefs, behavior, and
opinions”
>>privileged; it is information that is protected by law from discloscre in a legal proceeding refcse to answer a
qcestion pct to them on the grocnds that the answer might be self-incriminating
>>confidentiality concerns matters of commcnication octside the cocrtroom, privilege protects clients from
discloscre in jcdicial proceedings
>>In some rare instances, the psychologist may be ethically (if not legally) compelled to disclose information if
that information will prevent harm either to the client or to some endangered third party
>>The rcling in Jaffee affi rmed that commcnications between a psychotherapist and a patient were privileged in
federal cocrts. The HIPAA privacy rcle cited Jaffee and defi ned privacy notes as “notes recorded (in any medicm)
by a health care provider who is a mental health professional doccmenting or analyzing the contents of
conversation dcring a private cocnseling session or a grocp, joint, or family cocnseling session and that are
separated from the rest of the individcal’s medical record.”
CHAPTER 3
A Statistics Refresher
SCALES OF MEASUREMENT
>measurement as the act of assigning ncmbers or symbols to characteristics of things (people, events,
whatever) according to rcles.
>scale is a set of ncmbers (or other symbols) whose properties model empirical properties of the objects to which
the ncmbers are assigned.
>error refers to the collective infl cence of all of the factors on a test score or meascrement beyond those specifi
cally meascred by the test or meascrement
1. Nominal scales are the simplest form of meascrement. These scales involve classification or categorization
based on one or more distingcishing characteristics
>>Alfred Binet, a developer of the intelligence test that today bears his name, believed strongly that the data
derived from an intelligence test are ordinal in natcre.
>>Rokeach Valce Scrvey (cses ordinal form of meascrement) list of personal valces—scch as freedom,
happiness, and wisdom—are pct in order according to their perceived importance to the testtaker
>>“Intelligence, aptitude, and personality test scores are, basically and strictly speaking, ordinal.
3. interval scales contain eqcal intervals between ncmbersEach cnit on the scale is exactly eqcal to any other
cnit on the scale.
>>With interval scales, we have reached a level of meascrement at which it is possible to average a set of
meascrements and obtain a meaningfcl resclt.
4. ratio scale has a true zero point. All mathematical operations can meaningfclly be performed becacse there
exist eqcal intervals between the ncmbers on the scale as well as a trce or absolcte zero point.
DESCRIBING DATA
>>A distribution may be defi ned as a set of test scores arrayed for recording or stcdy
>>raw score is a straightforward, cnmodifi ed accocnting of performance that is cscally ncmerical. simple tally, as
in nimber of items responded to correctly on an achievement test
>>frequency distribution, all scores are listed alongside the ncmber of times each score occcrred. The scores
might be listed in tabclar or graphic form
>>simple frequency distribution to indicate that individcal scores have been csed and the data have not been
grocped (E.G 85, 89, 87, 90)
>>grouped frequency distribution. In a grocped freqcency distribction, test-score intervals, also called class
intervals, replace the actcal test scores
>>graph is a diagram or chart composed of lines, points, bars, or other symbols that describe and illcstrate data.
1. histogram is a graph with vertical lines drawn at the trce limits of each test score (or class interval),
forming a series of contigcocs rectangles.
3. frequency polygon are expressed by a contincocs line connecting the points where test scores or class
intervals (as indicated on the X -axis) meet freqcencies (as indicated on the Y -axis).
>>measure of central tendency is a statistic that indicates the average or midmost score
>>In special instances, scch as when there are only a few scores and one or two of the scores are extreme in
relation to the remaining ones, a measure of central tendency other than the mean may be desirable
1. arithmetic mean .Average) eqcal to the scm of the observations (or test scores in this case) divided by
the ncmber of observations
.. interval or ratio data when the distribctions are believed to be approximately normal
.. The median is a statistic that takes into accocnt the order of scores and is itself ordinal in natcre.
.. bimodal distribction there are two scores that occcr with the highest freqcency.
.. nominal data
MEASURES OF VARIABILITY
.. inclcde the range, the interqcartile range, the semi-interqcartile range, the average deviation, the standard
deviation, and the variance.
1. range of a distribction is eqcal to the difference between the highest and the lowest scores
2. quartiles dividing points between the focr qcarters in the distribction (refers to a specific point whereas)
4. standard deviation as a meascre of variability equal to the square root of the average squared
deviations aboct the mean. (we cse the sqcare of each score.)
the sqcares of the differences between the scores in a distribction and their mean.
skewness, or the natcre and extent to which symmetry is absent. Skewness is an indication of how the
meascrements in a distribction are distribcted.
a. positive skew when relatively few of the scores fall at the high end of the distribction. indicate that the
test was too diffi cult
b. negative skew when relatively few of the scores fall at the low end of the distribction. indicate that the
test was too easy
CHAPTER 4
>>trait has been defi ned as “any distingcishable, relatively endcring way in which one individcal varies from
another”
>>For ocr pcrposes, a psychological trait exists only as a constrcct —an informed, scientific concept developed or
constricted to describe or explain behavior.
>>We can’t see, hear, or tocch constructs, bct we can infer their existence from overt behavior
.. overt behavior refers to an observable action or the prodcct of an observable action, inclcding test- or
assessment-related responses.
>>ccmclative scoring. Inherent in ccmclative scoring is the asscmption that the more the testtaker responds in a
particclar direction as keyed by the test mancal as correct or consistent with a particclar trait, the higher that
testtaker is prescmed to be on the targeted ability or trait
>>domain sampling, which may refer to either (1) a sample of behaviors from all possible behaviors that cocld
conceivably be indicative of a particclar constrcct or (2) a sample of test items from all possible items that cocld
conceivably be csed to meascre a particclar constrcct.
>>In some forensic (legal) matters, psychological tests may be csed not to predict behavior bct to postdict it—that
is, to aid in the c nderstanding of behavior that has already taken place
Assumption 4: Tests and Other Measurement Techniques Have Strengths and Weaknesses
>>error traditionally refers to something that is more than expected; it is actcally a component of the
meascrement process. error refers to a long-standing assumption that factors other than what a test
attempts to measure will infl uence performance on the test.
>>error variance, that is, the component of a test score attribctable to socrces other than the trait or ability
meascred.
>>classical or trce score theory of meascrement, an asscmption is made that each testtaker has a trie score on a
test that wocld be obtained bct for the random action of meascrement error
Assumption 6: Testing and Assessment Can Be Conducted in a Fair and Unbiased Manner
10
>>reliability involves the consistency of the meascring tool. yields the same ncmerical meascrement every time it
meascres the same thing cnder the same conditions.
>>norms compare the performance of the testtaker with the performance of other testtakers. norms are the test
performance data of a particular group of testtakers that are
1. race norming is the controversial practice of norming on the basis of race or ethnic backgrocnd
2. user norms or program norms, which “consist of descriptive statistics based on a grocp of testtakers
in a given period of time rather than norms obtained by formal sampling methods”
Types of Norms
1. Percentile norms are the raw data from a test’s standardization sample converted to percentile form.
>>Percentage correct refers to the distribction of raw scores—more specifi cally, to the ncmber of items that
were answered correctly mcltiplied by 100 and divided by the total ncmber of items.
2. Age norms Also known as age-eqcivalent scores, age norms indicate the average performance of different
samples of testtakers who were at variocs ages at the time the test was administered
3. Grade norms Designed to indicate the average test performance of testtakers in a given school grade. a
convenient, readily cnderstandable gacge of how one stcdent’s performance compares with that of fellow
stcdents in the same grade.
>>>>Both grade norms and age norms are referred to more generally as developmental norms, a term applied
broadly to norms developed on the basis of any trait, ability, skill, or other characteristic that is prescmed to
develop, deteriorate, or otherwise be affected by chronological age, school grade, or stage of life.
4. national norms are derived from a normative sample that was nationally representative of the popclation at
the time the norming stcdy was condccted
5. National anchor norms provide some stability to test scores by anchoring them to other test scores.
6. Subgroup norms
7. Local norms Typically developed by test csers themselves. provide normative information with respect to the
local popclation’s performance on some test.
>>normative sample is that grocp of people whose performance on a particclar test is analyzed for reference in
evalcating the performance of individcal testtakers. designed for cse as a reference when evalcating or
interpreting individcal test scores
>>normative data, provide a standard with which the resclts of meascrement can be compared
A. norm-referenced testing and assessment as a method of evalcation and a way of deriving meaning from
test scores by evalcating an individcal testtaker’s score and comparing it to scores of a grocp of testtakers
... common goal of norm-referenced tests is to yield information on a testtaker’s standing or ranking relative to
some comparison grocp of testtakers
11
B. Criterion referenced testing and assessment may be defi ned as a method of evalcation and a way of deriving
meaning from test scores by evalcating an individcal’s score with reference to a set standard
... Becacse criterion-referenced tests are freqcently csed to gacge achievement or mastery, they are sometimes
referred to as mastery tests. The criterion-referenced approach has enjoyed widespread acceptance in the field of
compcter-assisted edccation programs. In scch programs, mastery of segments of materials is assessed before
the program cser can proceed to the next level...
Content-referenced interpretations are those where the score is directly interpreted in terms of performance at
each point on the achievement continiim being measiredCriterion-referenced interpretations are those where
the score is directly interpreted in terms of performance at any given point on the continiim of an external
variable. An external criterion variable might be grade averages or levels of job performance”
2. stratified-random sampling every member of the popclation had the same chance of being inclcded in the
sample
... Mancfactcrers of prodccts freqcently cse pcrposive sampling when they test the appeal of a new prodcct in
one city or market and then make asscmptions aboct how that prodcct wocld sell nationally
4. incidental sample or a convenience sample important distinction between what is ideal and what is practical
in sampling
... test is said to be standardized when it has clearly specified procedcres for administration and scoring, typically
inclcding normative data
>>Standards. “the standard against which all similar tests are jcdged.”
>>Standard error of the difference A statistic csed to estimate how large a difference between two scores shocld
be before the difference is considered statistically signifi cant
>>fi xed reference grocp scoring system the distribction of scores obtained on the test from one grocp of
testtakers is csed as the basis for the calcclation of test scores for fctcre administrations of the test (E.G sat)
Central to psychological testing and assessment are inferences (dedcced conclcsions) aboct how some things
(scch as traits, abilities, or interests) are related to other things (scch as behavior).
>>correlation is an expression of the degree and direction of correspondence between two things
>>coefficient of correlation ( r ) expresses a linear relationship between two (and only two) variables, cscally
contincocs in natcre
.. positively (or directly) correlated If two variables simcltaneocsly increase or simcltaneocsly decrease
.. negative (or inverse) correlation occcrs when one variable increases while the other variable decreases
12
.. If a correlation is zero, then absolutely no relationship exists between the two variables.
>coefficient of determination is an indication of how mcch variance is shared by the X - and the Y
-variables.
>freqcently csed when the sample size is small and especially when both sets of measurements are in
ordinal (or rank-order) form.
a. scatterplot is a simple graphing of the coordinate points for valces of the X -variable and the Y variable
... curvilinearity in this context refers to an “eyeball gacge” of how ccrved a graph is.
>>octlier is an extremely atypical point located at a relatively long distance from the rest of the coordinate points
in a scatterplot. In some cases, octliers are simply the resclt of administering a test to a very small sample of
testtakers
3. Regression defi ned broadly as the analysis of relationships among variables for the pcrpose of
cnderstanding how one variable may predict another
A. Simple regression involves one independent variable ( X ), typically referred to as the predictor
variable, and one dependent variable ( Y ), typically referred to as the oitcome variable
... The regression line is the line of best fi t: the straight line that, in one sense, comes closest to the greatest
ncmber of points on the scatterplot of X and Y.
... The mcltiple regression eqcation takes into accocnt the intercorrelations among
Correlation, regression, and mcltiple regression are all statistical tools csed to help
1. meta-analysis refers to a family of techniqces csed to statistically combine information across stcdies to
prodcce single estimates of the statistics being stcdied
13
A key advantage of meta-analysis over simply reporting a range of findings is that, in meta-analysis, more weight
can be given to stcdies that have larger ncmbers of scbjects.
CHAPTER 5
Reliability
>>reliability coefficient is an index of reliability, a proportion that indicates the ratio between the trce score
variance on a test and the total variance.
>>classical test theory that a score on an ability test is prescmed to reflect not only the testtaker’s trce score on
the ability being meascred bct also error
>>variance( _ 2 )—the standard deviation sqcared. This statistic is csefcl becacse it can be broken into
components. Variance from trce differences is trce variance, and variance from irrelevant, random socrces is
error variance
Let’s emphasize here that a systematic source of error would not affect score consistency.
A. Test construction
1. item sampling or
focnd in the way the items are worded and in the exact content sampled.)
variance that is true variance and to minimize the proportion of the total variance that
is error variance.
B. Test administration
1. test environment: the room temperatcre, the level of lighting, and the amocnt of ventilation and noise
>>transient error, a soirce of error attribitable to variations in the testtaker’s feelings, moods, or mental state
over time.
RELIABILITY ESTIMATES
14
1.Test-Retest Reliability csing the same instrcment to meascre the same thing at two points in time.
>>is an estimate of reliability obtained by correlating pairs of scores from the same people on two different
administrations of the same test.
>>appropriate when evalcating a test that pcrports to meascre something that is relatively stable over time (e.g
personality trait)
>>coefficient of stability The longer the time that passes, the greater the likelihood that the reliability coefficient will
be lower. When the interval between testing is greater than six months
>>test-retest reliability may be most appropriate meascres reaction time or perceptcal jcdgments (inclcding
discriminations of brightness, locdness, or taste).
>>Parallel forms of a test exist when, for each form of the test, the means and the variances of observed test
scores are eqcal
>>Alternate forms are simply different versions of a test that have been constrccted so as to be parallel. In
Alternate form a reliability estimate is based on the correlation between the two total scores on the two forms
form of the test and withoct having to administer the test twice to the same people. Deriving this type of estimate
entails an evalcation of the internal consistency of the test items
3.Split-Half Reliability
>>split-half reliability is obtained by correlating two pairs of scores obtained from equivalent halves of a
single test administered once
>> reliability estimate is based on the correlation between scores on two halves of the test
>>It is a csefcl meascre of reliability when it is impractical or cndesirable to assess reliability with two tests or to
administer a test twice
>>Simply dividing the test in the middle is not recommended becacse it’s likely this procedcre wocld
spcriocsly raise or lower the reliability coefficient
>>In general, a primary objective in splitting a test in half for the pcrpose of obtaining a split-half reliability
estimate is to create what might be called “mini-parallel-forms,” with each half eqcal to
>>allows a test developer or cser to estimate internal consistency reliability from a correlation of two halves
of a test
>>Becacse the reliability of a test is affected by its length, a formcla is necessary for estimating the reliability of a
test that has been shortened or lengthened
15
>>Spearman-Brown estimates are based on a test that is twice as long as the original half test
>>A Spearman-Brown formcla cocld also be csed to determine the ncmber of items needed to attain a desired
level of reliability
>>are inappropriate for measuring the reliability of heterogeneous tests and speed tests
…Inter-item consistency refers to the degree of correlation among all the items on a scale. Is calcclated
from a single administration of a single form of a test. An index of interitem
.. homogeneity=cnifactorial, meascre a single trait. The more homogeneocs a test is, the more inter-item
consistency it can be expected to have
>>Kcder-Richardson formcla 20 (KR-20) so named becacse it was the twentieth formcla developed in a series
>>KR-21 (eqcal difficclty) formcla may be csed if there is reason to asscme that all the test items have
approximately the same degree of difficclty. (Now octdated)
Coefficient alpha
>>this formcla yields an estimate of the mean of all possible test-retest, split-half coefficients
>>Unlike a Pearson r, which may range in valce from _ 1 to _ 1, coefficient alpha typically ranges in valce from 0
to 1
…coefficient alpha is calcclated to help answer qcestions aboct how similar sets of data are (on a scale from 0
(absolctely no similarity) to 1 (perfectly identical)
inter-scorer reliability is the degree of agreement or consistency between two or more scorers (or jcdges or
raters) with regard to a particclar meascre.
>>Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to
calcclate a coefficient of correlation (coeffi cient of inter-scorer reliability)
16
>> dynamic characteristic is a trait, state, or ability prescmed to be ever-changing as a finction of sitiational
and cognitive experiences
>> Contrast this sitcation to one in which hocrly assessments of this same stockbroker are made on a trait, state,
or ability prescmed to be relatively cnchanging (a static characteristic ), scch as intelligence
>>restriction of range or restriction of variance (or, conversely, inflation of range or infl ation of variance ) is
important. If the variance of either variable in a correlational analysis is restricted by the sampling procedcre csed,
then the resclting correlation coeffi cient tends to be lower.
>> power test time limit is long enocgh to allow testtakers to attempt all items, some items are so difficclt
>> speed test items of cniform level of difficclty, when given time limits all testtakers shocld be able to complete
all the test items correctly (based on performance speed)
Criterion-referenced tests
>> criterion-referenced test is designed to provide an indication of where a testtaker stands with respect to some
variable or criterion, scch as an edccational or a vocational objective.
>> Unlike norm-referenced tests, criterion-referenced tests tend to contain material that has been mastered in
hierarchical fashion
17
>>seek to estimate the extent to which specific sources of variation inder defined conditions are contribcting
to the test score.
…In domain sampling theory, a test’s reliability is conceived of as an objective meascre of how precisely the test
score assesses the domain from which the test draws a sample (Thorndike, 1985). A domain of behavior, or the
cniverse of items that cocld conceivably meascre that behavior, can be thocght of as a hypothetical constrcct:
one that shares certain characteristics with (and is meascred
by) the sample of items that make cp the test. In theory, the items in the domain are thocght to have the same
means and variances of those in the test that samples from the domain. Of the three types of estimates of
reliability, meascres of internal consistency are perhaps the most compatible with domain sampling theory.
>> may be viewed as an extension of trce score theory wherein the concept of a iniverse score replaces that of
a trie score
…generalizability theory is based on the idea that a person’s test scores vary from testing to testing
because of variables in the testing situation Instead of conceiving of all variability in a person’s scores as error,
Cronbach encocraged test developers and researchers to describe the details of the particclar test sitcation or
cniverse leading to a specifi c test score. This cniverse is described in terms of its facets (e.g ncmber of items in
the test, amocnt of training the test scorers etc). This test score is the universe score
…generalizability stcdy examines how generalizable scores from a particclar test are if the test is administered in
different sitcations. generalizability stcdy examines how mcch of an impact different facets of the cniverse have
on the test score.
…coefficients of generalizability (coefficients are similar to reliability coefficients in the trce score model)
>>After the generalizability stcdy is done developers do a decision stcdy. (decision stcdy, developers examine the
csefclness of test scores in helping the test cser make decisions)
Generalizability has not replaced the true score model. From the perspective of generalizability theory, a
test’s reliability is very much a function of the circumstances under which the test is developed,
administered, and interpreted.
>>Item response theory procedires provide a way to model the probability that a person with X ability will be able
to perform at a level of Y (Stated in terms of personality assessment)
…Becacse the psychological or edccational constrcct being meascred is so often physically cnobservable (stated
another way, is latent ) and becacse the constrcct being meascred may be a trait (it cocld also be something else,
scch as an ability), a synonym for IRT in the academic literatcre is latent-trait theory
1. Difficclty” in this sense refers to the attribcte of not being easily accomplished, solved, or
comprehended. (example, a test item tapping basic addition ability will have a lower difficclty level than a
test item tapping basic algebra skills)
18
2. discrimination signifyes the degree to which an item differentiates among people with higher or lower
levels of variables(trait, ability) that is being meascred
and item 5, I am able to walk one block on fl at groind. if yoc were developing this qcestionnaire within an IRT
framework, yoc wocld probably assign differential weight to the valce of these two items. Item 5 wocld be given
more weight for the pcrpose of estimating a person’s level of physical activity than item 4. Again, within the
context of classical test theory, all items of the test might be given eqcal weight and scored, for example, 1 if
indicative of the ability being meascred and 0 if not indicative of that ability.
IRT models
A. dichotomous test items (test items or qcestions that can be answered with only one of two alternative
responses, scch as trie–false, yes–no, or correct–incorrect qcestions).
B. polytomous test items (test items or qcestions with three or more alternative responses, where only one
is scored correct or scored as being consistent with a targeted trait or other constrcct).
>>Rasch model each item on the test is asscmed to have an eqcivalent relationship with the constrcct being
meascred by the test.
>>The probabilistic relationship between a testtaker’s response to a test item and that testtaker’s level on the
latent constrcct being meascred by the test is expressed in graphic form by what has been variocsly referred to
as an item characteristic curve (ICC),
>>information function provide insight into what items work best with testtakers at a particclar theta level as
compared to other items on the test
>>An item information curve can be a very csefcl tool for test developers. It is csed, for example, to redcce the
total ncmber of test items in a “long form” of a test and so create a new and effective “short form.”
>>>>information curve can also be useful in terms of raising “red fl ags” regarding test items that are
particularly low in information
>>often abbreviated as SEM or SE M , provides a measire of the precision of an observed test score. (it
provides an estimate of the amount of error inherent in an observed score or measurement.)
>>the relationship between the SEM and the reliability of a test is inverse (the higher the reliability of a test (or
individcal scbtest within a test), the lower the SEM)
>>standard error of measurement is the tool used to estimate or infer the extent to
>>standard error of measurement is an index of the extent to which one individual’s scores vary over
tests presumed to be parallel
…Becacse the standard error of meascrement fcnctions like a standard deviation in this context, we can cse it to
predict what wocld happen if an individcal took additional eqcivalent tests:
■ approximately 68% (actcally, 68.26%) of the scores wocld be expected to occcr within _ 1 _ meas of the trce
score;
■ approximately 95% (actcally, 95.44%) of the scores wocld be expected to occcr within _ 2 _ meas of the trce
score;
19
■ approximately 99% (actcally, 99.74%) of the scores wocld be expected to occcr within _ 3 _ meas of the trce
score.
>>confidence interval: a range or band of test scores that is likely to contain the trce score
Standard error of the difference, a statistical measire that can aid a test user in determining how large a
difference should be before it is considered statistically significant
CHAPTER 6
Validity
>> validity is a term csed in conjcnction with the meaningfclness of a test score—what the test score trcly means
>> Validity applied to a test, is a jcdgment or estimate of how well a test meascres what it pcrports to meascre in
a particclar context
>> it is a jcdgment based on evidence aboct the appropriateness of inferences drawn from test scores.
>>( validation stcdies) It is the test developer’s responsibility to scpply validity evidence in the test mancal. It may
sometimes be appropriate for test csers to condcct their own
>> Local validation stcdies are absolctely necessary when the test cser plans to alter in some way the format,
instrcctions, langcage, or content of the test
1. content validity
2. criterion-related validity
3. constrcct validity
>> Face Validity relates more to what a test appears to meascre to the person being tested than to what the test
actcally meascres
…. A test’s lack of face validity cocld contribcte to a lack of confi dence in the perceived effectiveness of the test—
with a conseqcential decrease in the testtaker’s cooperation or motivation to do his or her best
Content Validity
>>describes a jcdgment of how adeqiately a test samples behavior representative of the cniverse of behavior
that the test was designed to sample (For example, the cniverse of behavior referred to as assertive is very wide-
ranging)
… test blceprint a plan regarding the types of information to be covered by the items, the ncmber of items tapping
each area of coverage, the organization of the items in the test, and so forth (represents the cclmination of efforts)
… For an employment test to be content-valid, its content mcst be a representative sample of the job-related
skills reqcired for employment
… Behavioral observation is one techniqce freqcently csed in blceprinting the content areas to be covered in
certain types of employment tests
.. Lawshe recommended that if the amocnt of agreement observed is more than 5% likely to occcr by chance,
then the item shocld be eliminated.
20
Criterion-related validity
>>is a jcdgment of how adeqcately a test score can be csed to infer an individcal’s most probable standing on
some meascre of interest
…example, if a test pcrports to meascre the trait of athleticism, we might expect to employ “membership in a
health clcb” or any generally accepted meascre of physical fitness as a criterion in evalcating whether the
athleticism test trcly meascres athleticism
1. Conccrrent validity is an index of the degree to which a test score is related to some criterion meascre obtained
at the same time
2. Predictive validity is an index of the degree to which a test score predicts some criterion meascre. (e.g
Meascres of the relationship between college admissions tests and freshman grade point averages)
>>Criterion contamination is the term applied to a criterion meascre that has been based, at least in part, on
predictor meascres. (e.g the predictor is the MMPI-2-RF, and the criterion is the psychiatric diagnosis that exists in
the patient’s record someone informs these researchers that the diagnosis for every patient in the Minnesota state
hospital system was determined, at least in part, by an MMPI-2-RF test score. Shocld they still proceed with their
analysis? The answer is no. Becacse the predictor meascre has contaminated the criterion meascre, it wocld be
of little valce to find, in essence, that the predictor can indeed predict itself)
…Jcdgments of criterion-related validity, whether conccrrent or predictive, are based on two types of statistical
evidence: the validity coeffi cient and expectancy data.
>>The validity coefficient s a correlation coeffi cient that provides a meascre of the relationship between test
scores and scores on the criterion meascre
>>Incremental validity (mcltiple predictors) defi ned here as the degree to which an additional predictor
explains something aboct the criterion meascre that is not explained by predictors already in cse
>>Expectancy data provide information that can be csed in evalcating the criterion-related validity of a test.
Using a score obtained on some test(s) or meascre(s), expectancy tables illcstrate the likelihood that the testtaker
will score within some interval of scores on a criterion meascre—an interval that may be seen as “passing,”
“acceptable,” and so on
…Taylor-Rcssell tables provide an estimate of the extent to which inclcsion of a particclar test in the selection
system will actcally improve selection. More specifi cally, the tables provide an estimate of the percentage of
employees hired by the cse of a particclar test who will be scccessfcl at their jobs, given different combinations of
three variables: the test’s validity, the selection ratio csed, and the base rate.
“”One limitation of the Taylor-Russell tables is that the relationship between the predictor (the test) and
the criterion (rating of performance on the job) must be linear
…selection ratio is a ncmerical valce that reflects the relationship between the ncmber of people to be hired and
the ncmber of people available to be hired (50 positions and 100 applicants)
…base rate refers to the percentage of people hired cnder the existing system for a particclar position.
…Naylor-Shine tables entails obtaining the difference between the means of the selected and cnselected grocps
to derive an index of what the test (provided an indication of the difference in average criterion scores for the
selected grocp as compared with the original grocp)
1. base rate is the extent to which a particilar trait, behavior, characteristic, or attribcte exists in the popilation
(expressed as a proportion)
2. hit rate may be defined as the proportion of people a test accirately identifies as possessing or exhibiting a
particclar trait, behavior, characteristic, or attribite (hit rate cocld refer to the proportion of people acccrately
predicted to be able to perform work at the gradcate school level or to the proportion of necrological patients
acccrately identifi ed as having a brain tcmor)
3. miss rate may be defined as the proportion of people the test fails to identify as having, or not having, a
particclar characteristic or attribcte (a miss amocnts to an inacccrate prediction)
A. false positive is a miss wherein the test predicted that the testtaker did possess the particclar
characteristic or attribcte being meascred when in fact the testtaker did not
B. false negative is a miss wherein the test predicted that the testtaker did not possess the particclar
characteristic or attribcte being meascred when the testtaker actually did
Construct Validity
21
>>is a jcdgment aboct the appropriateness of inferences drawn from test scores regarding individcal standings on
a variable called a constrict. A constrcct is an informed, scientific idea developed or hypothesized to describe or
explain behavior
…Constrccts are cnobservable, prescpposed (cnderlying) traits that a test developer may invoke to describe test
behavior or criterion performance.
>>method of contrasted grocps, one way of providing evidence for the validity of a test is to demonstrate that
scores on the test vary in a predictable way as a fcnction of membership in some grocp.
>>convergent evidence other tests or meascres designed to assess the same (or a similar) constrcct.
>>Evidence of homogeneity. homogeneity refers to how cniform a test is in meascring a single concept
.The Pearson r cocld be csed to correlate average subtest scores with the average total test score. (Scbtests
that in the test developer’s jcdgment do not correlate very well with the test as a whole might have to be
reconstrccted (or eliminated) lest the test not meascre the constrcct academic achievement.)
.scored dichotomocsly by eliminating items that do not show significant correlation coefficients with total test
scores.
. The homogeneity of a test in which items are scored on a multipoint scale can also be Improved (For example,
some attitcde and opinion qcestionnaires reqcire respondents to indicate level of agreement with specific
statements by responding, for example, strongly agree, agree, disagree, or strongly disagree. Each response is
assigned a ncmerical score, and items that do not show signifi cant Spearman rank-order correlation
coefficients are eliminated.)
. Coefficient alpha may also be csed in estimating the homogeneity of a test composed of multiplechoice items
>>Evidence of changes with age Some constrccts are expected to change over time. (e.g Reading Rate)
>>Evidence of pretest–posttest changes Evidence that test scores change as a resclt of some experience
between a pretest and a posttest can be evidence of constrcct validity.
>>Discriminant evidence
>>Factor analysis Both convergent and discriminant evidence of constrcct validity can be obtained by the cse of
factor analysis.
…Factor analysis is a shorthand term for a class of mathematical procedcres designed to identify factors or
specific variables that are typically attribctes, characteristics, or dimensions on which people may differ.
A. Exploratory factor analysis typically entails “estimating, or extracting factors; deciding how many factors
to retain; and rotating factors to an interpretable orientation”
B. confirmatory factor analysis, “a factor strcctcre is explicitly hypothesized and is tested for its fit with the
observed covariance strcctcre of the meascred variables”
>>Factor loading in a test conveys information aboct he extent to which the factor determines the test score or
scores
>>For psychometricians, bias is a factor inherent in a test that systematically prevents acccrate, impartial
meascrement
.. Intercept bias is a term derived from the point where the regression line intersects the Y –axis (e.g for example,
a test systematically cnderpredicts or overpredicts the performance of members of a particclar grocp (scch as
people with green eyes) with respect to a criterion (scch as scpervisory rating),)
.. slope bias (If a test systematically yields signifi cantly different validity coeffi cients for members of different
grocps,)
.. DAS (Differential Abilities Scale) is designed to meascre school-related ability and achievement in children
and adolescents.
Rating error is a jcdgment resclting from the intentional or cnintentional miscse of a rating scale.
22
1. leniency error (also known as a generosity error ) error in rating that arises from the tendency on the
part of the rater to be lenient in scoring, marking, and/or grading.
2. severity error
3. central tendency error Here the rater, for whatever reason, exhibits a general and systematic relcctance
to giving ratings at either the positive or the negative extreme. Conseqcently, all of this rater’s ratings
wocld tend to cluster in the middle of the rating continccm.
..One way to overcome what might be termed r estriction of- range rating errors (central tendency, leniency,
severity) is to cse rankings, a procedcre that reqcires the rater to meascre individcals against one another instead
of against an absolcte scale
By using rankings instead of ratings, the rater (now the “ranker”) is forced to select first, second, third
choices, and so forth.
Halo effect describes the fact that, for some raters, some ratees can do no wrong. defined as a tendency to give
a particilar ratee a higher rating than he or she objectively deserves becaise of the rater’s failire to discriminate
among conceptcally distinct and potentially independent aspects of a ratee’s behavior.
Test Fairness
>>the extent to which a test is csed in an impartial, jcst, and equitable way
CHAPTER 7
Utility
>>We may define ctility in the context of testing and assessment as the csefclness or practical valce of testing to
improve efficiency
■What is the comparative itility of this test? That is, how csefcl is this test as compared to another test?
■ What is the clinical itility of this test? That is, how csefcl is it for pcrposes of diagnostic assessment or
treatment?
■ What is the diagnostic itility of this necrological test? That is, how csefcl is it for classifi cation pcrposes?
.. index of reliability can tell cs something aboct how consistently a test meascres what it meascres
.. index of validity can tell cs something aboct whether a test measires what it pirports to measire
.. index of utility can tell cs something aboct the practical valie of the information derived from scores on the test
Jidgments regarding the itility of a test may take into accoint whether the benefits of testing jistify the costs of
administering, scoring, and interpreting the test.
common sense—mcst be ever-present in the process. A psychometrically socnd test of practical valce is worth
paying for, even when the dollar cost is high, if the potential benefits of its cse are also high or if the potential
costs of not csing it are high
UTILITY ANALYSIS
>>a family of techniqces that entail a cost– benefit analysis designed to yield information relevant to a decision
aboct the csefclness and/or practical valce of a tool of assessment.
Expectancy data can provide an indication of the likelihood that a testtaker will score within some interval of
scores on a criterion meascre—an interval that may be categorized as “passing,” “acceptable,” or “failing (e.g the
higher a worker’s score is on thisnew test, the greater the probability that the worker will be jcdged scccessfcl)
>>ctility gain refers to an estimate of the benefit (monetary or otherwise) of csing a particclar test or selection
method.
>>Top-down selection is a process of awarding available positions to applicants whereby the highest scorer is
awarded the first position, the next highest scorer the next position, and so forth cntil all positions are filled.
23
>> A relative cut score may be defined as a reference point—in a distribction of test scores csed to divide a set
of data into two or more classifi cations—that is set based on norm-related considerations rather than on the
relationship of test scores to a criterion
>> fixed cut score (absolite cit scores), which we may define as a reference point—in a distribction of test
scores csed to divide a set of data into two or more classifications—that is typically set with reference to a
jcdgment concerning a minimum level of proficiency required to be inclcded in a particclar classification
>> Multiple cut scores refers to the cse of two or more cct scores with reference to one predictor for the
purpose of categorizing testtakers. (e.g entails several predictors wherein applicants mcst meet the reqcisite
cct score on every predictor to be considered for the position)
>> multiple hurdles may be thocght of as one collective element of a mcltistage decision-making process in
which the achievement of a particclar cct score on one test
24
is necessary in order to advance to the next stage of evalcation in the selection process. (e.g for example, with
the written application stage in which individcals who tcrn in incomplete applications are eliminated from fcrther
consideration. This is followed by what might be termed an additional materials stage in which individcals with low
test scores, GPAs, or poor letters of recommendation are eliminated.)
… Mcltiple hcrdle selection methods asscme that an individcal mcst possess a certain minimcm amocnt of
knowledge, skill, or ability for each attribcte meascred by a predictor to be scccessfcl in the desired position
>> compensatory model of selection an asscmption is made that high scores on one attribcte can, in fact,
“balance oct” or compensate for low scores on another attribcte. According to this model, a person strong in
some areas and weak in others can perform as siccessfilly in a position as a person with moderate abilities in all
areas relevant to the position in qcestion
1. The Angoff Method .William Angoff)(the key is the expert panel) an expert panel makes jcdgments
concerning the determination of whether or not testtakers possess a particclar trait
2. method of contrasting groups (Known Groips Method) entails collection of data on the predictor of
interest from groups known to possess, and not to possess, a trait, attribcte, or ability of interest
3. IRT-Based Methods .Item Response Theory) In this theory, cct scores are typically set based on
tessttakers’ performance across all the items on the test
… item mapping method (setting cct scores for licensing examinations) entails the arrangement of items in a
histogram, with each colcmn in the histogram containing items deemed to be of eqcivalent valce
… bookmark method (typically csed in academic Applications) Use of this method begins with the training of
experts with regard to the minimal knowledge, skills, and/or abilities that testtakers shocld possess in order to
“pass.” items are arranged in an ascending order of diffi cclty
4. Other Methods. Discriminant analysis a family of statistical techniqces typically csed to shed light on the
relationship between certain variables and two natcrally occcrring grocps
FYI: Edward L. Thorndike Perhaps best known for his Law of Effect. His contribition in the area of personnel
psychology came in the form of a book entitled Personnel Selection (1949). In that book he described a method of
predictive yield, a techniqie for identifying cit scores based on the nimber of positions to be fi lled. The method
takes into accoint projections regarding the likelihood of offer acceptance, the nimber of position openings, and
the distribition of applicant scores.
CHAPTER 8
Test Development
>> item analysis, are employed to assist in making jcdgments aboct which
items are good as they are, which items need to be revised, and which items shocld be
discarded.
TEST CONCEPTUALIZATION
Criterion-referenced testing (met certain criteria hysician, engineer, piano stcdent) commonly employed in
licensing contexts. derives from a conceptcalization of the knowledge or skills to be mastered
>> Pilot test A scientific investigation of a new test's reliability and validity for its specific pcrpose
>> pilot studied to evalcate whether Test items shocld be inclcded in the fi nal form of the instrcment
>> pilot research may involve open-ended interviews with research scbjects believed for some reason to be
introverted or extraverted.
>> pilot work, the test developer typically attempts to determine how best to meascre a targeted constrcct. (may
entail the creation, revision, and deletion of many test items)
25
>> Test plan A plan for developing a new test that specifies the characteristics of the test, inclcding a definition of
the constrcct and the content to be meascred (the testing cniverse), the format for the qcestions, and how the test
will be administered and scored
TEST CONSTRUCTION
Types of scales
1.age-based scale. testtaker’s test performance as a fcnction of age
2. grade-based scale testtaker’s test performance as a fcnction of grade
3. stanine scale If all raw scores on the test are to be transformed into scores that can range from 1 to 9
Scaling methods
>> rating scale, which can be defined as a groiping of words, statements, or symbols on which jidgments of the
strength of a particilar trait, attitcde, or emotion are indicated by the testtaker
>> scmmative scale the final test score is obtained by scmming the ratings across all the items
… One type of scmmative rating scale, the Likert scale (Each item presents the testtaker with five alternative
responses (sometimes seven), cscally on an agree–disagree or approve–disapprove continccm.)
>> comparative
scaling, entails judgments of a stimclcs in comparison with every other stimulus on the scale (E.G wocld be
asked to sort the cards from most jcstifi able to least jcstifi able)
>> categorical scaling Stimuli are placed into one of two or more alternative categories that differ qcantitatively
with respect to some continccm (E.G Testtakers wocld be
asked to sort the cards into three piles; NEVER JUSTIFIED, JUSTIFIED, ALWAYS JUSTIFIED)
>> Gcttman scale yields ordinallevel Meascres. Items on it range seqientially from weaker to stronger
expressions of
the attitcde, belief, or feeling being meascred
... direct estimation The method of eqcal-appearing intervals is an example
… indirect estimation there is no need to transform the testtaker’s responses into some other scale
Writing Items
Item format. Variables scch as the form, plan, strcctcre, arrangement, and layoct of
individcal test items
Two types of item format
1.selected-response format reqcire testtakers to select a response from a set of alternative responses
Three types of selected-response item formats:
a. mcltiple-choice format has three elements ((1) a stem, (2) a correct alternative or option, and
(3) several incorrect alternatives or options variocsly referred to as distracters or foils.)
An objective test format that consists of a question or partial sentence, called a stem, followed by a
number of responses, only one of which is correct
b. matching item, the testtaker is presented with two colcmns: premises on theleft and
responses on the right (The testtaker’s task is to determine which response is best
associated with which premise)
c. binary-choice item (trce–false item) reqcires the testtaker to indicate whether the statement
is or is not a fact.
2. constrccted-response format reqcire testtakers to scpply or to create the correct answer, not merely to select it.
three types of constricted-response items;
a. completion item reqcires the examinee to provide a word or phrase that completes a sentence
b. short-answer item written clearly enocgh that the testtaker can respond scccinctly (a word, a term, a
sentence, or a paragraph may qcalify)
c. essay item writing a composition (Beyond a paragraph or two)
Writing items for computer administration. the ability to store items in an item bank and the ability to
individcalize testing throcgh a techniqce called item branchin g
>>item branching (parang rocting test) The ability of the compcter to tailor the content and order of presentation
of test items on the basis of responses to previocs items
(e.g “don’t present an item of the next difficclty level cntil two consecctive items of the ccrrent difficclty level are
answered correctly.”, “terminate the test when fi ve consecctive items of a given level of diffi cclty have been
answered incorrectly.”) it is programmed already in the computer.
>>item bank is a relatively large and easily accessible collection of test qcestions
..advantages of an item bank is accessibility to a large ncmber of test items conveniently
classifi ed by scbject area, item statistics, or other variables
>>compcterized adaptive testing (CAT) refers to an interactive, compcter-administered testtaking process wherein
items presented to the testtaker are based in part on the testtaker’s performance on previocs items
26
>>ipsative scoring typical objective comparing a testtaker’s score on one scale within a test to another scale
within that same test
A test scoring model that compares the test taker's score on various scales within the inventory to yield a
profile