Rel&val

Validity/Reliability
Reliability

From the perspective of classical test
theory, an examinee's obtained test
score (X) is composed of two
components, a true score component
(T) and an error component (E):

X=T+E

Reliability
The true score component reflects the
examinee's status with regard to the
attribute that is measured by the test,
while the error component represents
measurement error.
Measurement error is random error. It is

due to factors that are irrelevant to what
is being measured by the test and that
have an unpredictable (unsystematic)
effect on an examinee's test score.
Reliability
The score you obtain on a test is likely to
be due both to the knowledge you have
about the topics addressed by exam items
(T) and the effects of random factors (E)
such as the way test items are written, any
alterations in anxiety, attention, or
motivation you experience while taking the
test, and the accuracy of your "educated
guesses."

Reliability
Whenever we administer a test to
examinees, we would like to know how
much of their scores reflects "truth" and
how much reflects error. It is a measure
of reliability that provides us with an
estimate of the proportion of variability in
examinees' obtained scores that is due to
true differences among examinees on the
attribute(s) measured by the test.
Reliability
When a test is reliable, it provides

dependable, consistent results
and, for this reason, the term
consistency is often given as a
synonym for reliability (e.g.,
Anastasi, 1988).
Consistency = Reliability
The Reliability Coefficient
Ideally, a test's reliability would be

calculated by dividing true score variance by
the obtained (total) variance to derive a
reliability index. This index would indicate
the proportion of observed variability in test
scores that reflects true score variability.
True Score Variance/Total Variance =

Reliability Index
A test's true score variance is not known, however,
and reliability must be estimated rather than
calculated directly.
There are several ways to estimate a test's

reliability. Each involves assessing the consistency
of an examinee's scores over time, across different
content samples, or across different scorers.
The common assumption for each of these

reliability techniques that consistent variability is
true score variability, while variability that is
inconsistent reflects random error.
Most methods for estimating reliability

produce a reliability coefficient, which is a
correlation coefficient that ranges in value
from 0.0 to + 1.0. When a test's reliability
coefficient is 0.0, this means that all
variability in obtained test scores is due to
measurement error. Conversely, when a
test's reliability coefficient is + 1.0, this
indicates that all variability in scores
reflects true score variability.
The reliability coefficient is symbolized
with the letter "r" and a subscript that
contains two of the same letters or
numbers (e.g., ''rxx'').
The subscript indicates that the

correlation coefficient was calculated by
correlating a test with itself rather than
with some other measure.
Regardless of the method used to calculate a reliability
coefficient, the coefficient is interpreted directly as the
proportion of variability in obtained test scores that reflects
true score variability. For example, as depicted in Figure 1, a
reliability coefficient of .84 indicates that 84% of variability in
scores is due to true score differences among examinees, while
the remaining 16% (1.00 - .84) is due to measurement error.
Figure 1. Proportion of variability in test scores
True Score Variability (84%) Error (16%)

Note that a reliability coefficient does not provide any

information about what is actually being measured by a
test!
A reliability coefficient only indicates whether the

attribute measured by the test— whatever it is—is being
assessed in a consistent, precise way.
Whether the test is actually assessing what it was

designed to measure is addressed by an analysis of the
test's validity.
Study Tip: Remember that, in contrast to other correlation

coefficients, the reliability coefficient is never squared to
interpret it but is interpreted directly as a measure of true
score variability. A reliability coefficient of .89 means that
89% of variability in obtained scores is true score variability.
Methods for Estimating
Reliability
The selection of a method for estimating
reliability depends on the nature of the
test.
Each method not only entails different

procedures but is also affected by
different sources of error. For many tests,
more than one method should be used.
1. Test-Retest Reliability:
The test-retest method for estimating

reliability involves administering the same
test to the same group of examinees on two
different occasions and then correlating the
two sets of scores. When using this method,
the reliability coefficient indicates the degree
of stability (consistency) of examinees' scores
over time and is also known as the coefficient
of stability.
Test-Retest Reliability
The primary sources of measurement error for
test-retest reliability are any random factors
related to the time that passes between the two
administrations of the test.
These time sampling factors include random

fluctuations in examinees over time (e.g.,
changes in anxiety or motivation) and random
variations in the testing situation.
Memory and practice also contribute to error

when they have random carryover effects; i.e.,
when they affect many or all examinees but not
in the same way.
Test-Retest Reliability
Test-retest reliability is appropriate for
determining the reliability of tests designed to
measure attributes that are relatively stable over
time and that are not affected by repeated
measurement.
It would be appropriate for a test of aptitude,

which is a stable characteristic, but not for a test
of mood, since mood fluctuates over time, or a
test of creativity, which might be affected by
previous exposure to test items.
2. Alternate (Equivalent,
Parallel) Forms Reliability:
To assess a test's alternate forms reliability,
two equivalent forms of the test are
administered to the same group of
examinees and the two sets of scores are
correlated.
Alternate forms reliability indicates the

consistency of responding to different item
samples (the two test forms) and, when the
forms are administered at different times,
the consistency of responding over time.
Alternate (Equivalent,
Parallel) Forms Reliability
 The alternate forms reliability
coefficient is also called the coefficient
of equivalence when the two forms are
administered at about the same time;
 and the coefficient of equivalence and

stability when a relatively long period
of time separates administration of the
two forms.
Alternate (Equivalent, Parallel)
Forms Reliability
The primary source of measurement error

for alternate forms reliability is content
sampling, or error introduced by an
interaction between different examinees'
knowledge and the different content
assessed by the items included in the two
forms (eg: Form A and Form B)
The items in Form A might be a better match of
one examinee's knowledge than items in Form B,
while the opposite is true for another examinee.
In this situation, the two scores obtained by

each examinee will differ, which will lower the
alternate forms reliability coefficient.
When administration of the two forms is

separated by a period of time, time sampling
factors also contribute to error.
Alternate (Equivalent, Parallel)
Forms Reliability
Like test-retest reliability, alternate

forms reliability is not appropriate
when the attribute measured by the
test is likely to fluctuate over time
(and the forms will be administered at
different times) or when scores are
likely to be affected by repeated
measurement.
 If the same strategies required to solve
problems on Form A are used to solve problems
on Form B, even if the problems on the two
forms are not identical, there are likely to be
practice effects.
 When these effects differ for different examinees

(i.e., are random), practice will serve as a source
of measurement error.
 Although alternate forms reliability is considered

by some experts to be the most rigorous (and
best) method for estimating reliability, it is not
often assessed due to the difficulty in developing
forms that are truly equivalent.
3. Internal Consistency
Reliability:
Reliability can also be estimated by measuring the
internal consistency of a test.
Split-half reliability and coefficient alpha are two

methods for evaluating internal consistency. Both
involve administering the test once to a single
group of examinees, and both yield a reliability
coefficient that is also known as the coefficient of
internal consistency.

Internal Consistency
Reliability
To determine a test's split-half reliability,
the test is split into equal halves so that
each examinee has two scores (one for
each half of the test).
Scores on the two halves are then

correlated. Tests can be split in several
ways, but probably the most common
way is to divide the test on the basis of
odd- versus even-numbered items.
Internal Consistency Reliability
A problem with the split-half method is that it produces a
reliability coefficient that is based on test scores that were
derived from one-half of the entire length of the test.
If a test contains 30 items, each score is based on 15 items.

Because reliability tends to decrease as the length of a test
decreases, the split-half reliability coefficient usually
underestimates a test's true reliability.
For this reason, the split-half reliability coefficient is ordinarily

corrected using the Spearman-Brown prophecy formula, which
provides an estimate of what the reliability coefficient would
have been had it been based on the full length of the test.

Reliability
Cronbach's coefficient alpha also involves
administering the test once to a single group of
examinees. However, rather than splitting the test in
half, a special formula is used to determine the
average degree of inter-item consistency.
One way to interpret coefficient alpha is as the

average reliability that would be obtained from all
possible splits of the test. Coefficient alpha tends to
be conservative and can be considered the lower
boundary of a test's reliability (Novick and Lewis,
1967).
When test items are scored dichotomously (right or

wrong), a variation of coefficient alpha known as the
Kuder-Richardson Formula 20 (KR-20) can be used.
Reliability
Content sampling is a source of error for both split-
half reliability and coefficient alpha.
 For split-half reliability, content sampling refers to the

error resulting from differences between the content of
the two halves of the test (i.e., the items included in
one half may better fit the knowledge of some
examinees than items in the other half);
 for coefficient alpha, content (item) sampling refers to

differences between individual test items rather than
between test halves.
Reliability
Coefficient alpha also has as a
source of error, the heterogeneity of
the content domain.
A test is heterogeneous with regard

to content domain when its items
measure several different domains
of knowledge or behavior.
Reliability
The greater the heterogeneity of the content
domain, the lower the inter-item correlations and
the lower the magnitude of coefficient alpha.
Coefficient alpha could be expected to be smaller

for a 200-item test that contains items assessing
knowledge of test construction, statistics, ethics,
epidemiology, environmental health, social and
behavioral sciences, rehabilitation counseling, etc.
than for a 200-item test that contains questions on
test construction only.
Reliability
The methods for assessing internal consistency
reliability are useful when a test is designed to
measure a single characteristic, when the
characteristic measured by the test fluctuates over
time, or when scores are likely to be affected by
repeated exposure to the test.
They are not appropriate for assessing the reliability

of speed tests because, for these tests, they tend to
produce spuriously high coefficients. (For speed tests,
alternate forms reliability is usually the best choice.)
4. Inter-Rater (Inter-Scorer,
Inter-Observer) Reliability:
Inter-rater reliability is of concern whenever
test scores depend on a rater's judgment.
A test constructor would want to make sure

that an essay test, a behavioral observation
scale, or a projective personality test have
adequate inter-rater reliability. This type of
reliability is assessed either by calculating a
correlation coefficient (e.g., a kappa
coefficient or coefficient of concordance) or
by determining the percent agreement
between two or more raters.
Inter-Rater (Inter-Scorer,
Inter-Observer) Reliability
Although the latter technique is frequently used, it
can lead to erroneous conclusions since it does not
take into account the level of agreement that
would have occurred by chance alone.
This is a particular problem for behavioral

observation scales that require raters to record the
frequency of a specific behavior.
In this situation, the degree of chance agreement

is high whenever the behavior has a high rate of
occurrence, and percent agreement will provide an
inflated estimate of the measure's reliability.
Sources of error for inter-rater reliability
include factors related to the raters such as
lack of motivation and rater biases and
characteristics of the measuring device.
An inter-rater reliability coefficient is likely

to be low, for instance, when rating
categories are not exhaustive (i.e., don't
include all possible responses or behaviors)
and/or are not mutually exclusive.
The inter-rater reliability of a behavioral rating scale
can also be affected by consensual observer drift,
which occurs when two (or more) observers working
together influence each other's ratings so that they
both assign ratings in a similarly idiosyncratic way.
(Observer drift can also affect a single observer's

ratings when he or she assigns ratings in a
consistently deviant way.) Unlike other sources of
error, consensual observer drift tends to artificially
inflate inter-rater reliability.
The reliability (and validity) of ratings can be
improved in several ways:
 Consensual observer drift can be eliminated by
having raters work independently or by alternating
raters.
 Rating accuracy is also improved when raters are
told that their ratings will be checked.
 Overall, the best way to improve both inter- and
intra-rater accuracy is to provide raters with training
that emphasizes the distinction between observation
and interpretation (Aiken, 1985).
RELIABILITY AND VALIDITY
Study Tip: Remember the Spearman-Brown formula is related

to split-half reliability and KR-20 is related to the coefficient
alpha. Also know that alternate forms reliability is the most
thorough method for estimating reliability and that internal
consistency reliability is not appropriate for speed tests.
Factors That Affect The
Reliability Coefficient
The magnitude of the reliability coefficient
is affected not only by the sources of error
discussed earlier, but also by the length of
the test, the range of the test scores, and
the probability that the correct response to
items can be selected by guessing.
 Test Length
 Range of Test Scores
 Guessing
1. Test Length:
The larger the sample of the attribute being
measured by a test, the less the relative
effects of measurement error and the more
likely the sample will provide dependable,
consistent information.
Consequently, a general rule is that the

longer the test, the larger the test's
reliability coefficient.

Test Length
The Spearman-Brown prophecy formula is most
associated with split-half reliability but can actually
be used whenever a test developer wants to
estimate the effects of lengthening or shortening a
test on its reliability coefficient.
For instance, if a 100-item test has a reliability

coefficient of .84, the Spearman-Brown formula
could be used to estimate the effects of increasing
the number of items to 150 or reducing the number
to 50.
A problem with the Spearman-Brown formula is that

it does not always yield an accurate estimate of
reliability: In general, it tends to overestimate a
test's true reliability (Gay, 1992).
Test Length
This is most likely to be the case when the added
items do not measure the same content domain as
the original items and/or are more susceptible to
the effects of measurement error.
Note that, when used to correct the split-half

reliability coefficient, the situation is more complex,
and this generalization does not always apply:
When the two halves are not equivalent in terms of
their means and standard deviations, the
Spearman-Brown formula may either over- or
underestimate the test's actual reliability.
2. Range of Test Scores:
Since the reliability coefficient is a
correlation coefficient, it is
maximized when the range of
scores is unrestricted.
The range is directly affected by

the degree of similarity of
examinees with regard to the
attribute measured by the test.
Range of Test Scores
When examinees are heterogeneous, the range of
scores is maximized.
The range is also affected by the difficulty level of

the test items.
When all items are either very difficult or very easy,

all examinees will obtain either low or high scores,
resulting in a restricted range.
Therefore, the best strategy is to choose items so

that the average difficulty level is in the mid-range
(r = .50).
3. Guessing:
A test's reliability coefficient is also affected by the
probability that examinees can guess the correct
answers to test items.
As the probability of correctly guessing answers

increases, the reliability coefficient decreases.
All other things being equal, a true/false test will have

a lower reliability coefficient than a four-alternative
multiple-choice test which, in turn, will have a lower
reliability coefficient than a free recall test.
The Interpretation of
Reliability
The interpretation of a test's
reliability entails considering
its effects on the scores
achieved by a group of
examinees as well as the score
obtained by a single examinee.

Interpretation of
Reliability Coefficient
The Reliability Coefficient: As discussed previously, a
reliability coefficient is interpreted directly as the
proportion of variability in a set of test scores that is
attributable to true score variability.
A reliability coefficient of .84 indicates that 84% of

variability in test scores is due to true score
differences among examinees, while the remaining
16% is due to measurement error.
While different types of tests can be expected to have

different levels of reliability, for most tests in the social
sciences, reliability coefficients of .80 or larger are
considered acceptable.
The Interpretation of
Reliability
When interpreting a reliability coefficient, it is
important to keep in mind that there is no
single index of reliability for a given test.
Instead, a test's reliability coefficient can vary

from situation to situation and sample to
sample. Ability tests, for example, typically
have different reliability coefficients for groups
of individuals of different ages or ability levels.
Interpretation of Standard
Error of Measurement
While the reliability coefficient is useful for
estimating the proportion of true score variability
in a set of test scores, it is not particularly helpful
for interpreting an individual examinee's obtained
test score.
When an examinee receives a score of 80 on a

100-item test that has a reliability coefficient of .
84, for instance, we can only conclude that, since
the test is not perfectly reliable, the examinee's
obtained score might or might not be his or her
true score.

A common practice when interpreting an
examinee’s obtained score is to construct a
confidence interval around that score.
The confidence interval helps a test user estimate

the range within which an examinee's true score is
likely to fall given his or her obtained score.
This range is calculated using the standard error of

measurement, which is an index of the amount of
error that can be expected in obtained scores due
to the unreliability of the test. (When raw scores
have been converted to percentile ranks, the
confidence interval is referred to as a percentile
band.)
The following formula is used to estimate the
standard error of measurement:

Formula 1: Standard Error of Measurement
SEmeas = SDx *(1 – rxx)1/2
Where:
SEmeas = standard error of measurement
SDx = standard deviation of test scores
rxx= reliability coefficient
As shown by the formula, the magnitude of
the standard error is affected by two factors:
 the standard deviation of the test scores (SDx),

and
 the test's reliability coefficient (rxx).
The lower the test's standard deviation and

the higher its reliability coefficient, the
smaller the standard error of measurement
(and vice versa).
Because the standard error is a type of standard
deviation, it can be interpreted in terms of the
areas under the normal curve.
With regard to confidence intervals, this means

that a 68% confidence interval is constructed by
adding and subtracting one standard error to an
examinee's obtained score; a 95% confidence
interval is constructed by adding and subtracting
two standard errors; and a 99% confidence
interval is constructed by adding and subtracting
three standard errors.
Example: A psychologist administers a interpersonal assertiveness
test to a sales applicant who receives a score of 80. Since the test's
reliability is less than 1.0, the psychologist knows that this score
might be an imprecise estimate of the applicant's true score and
decides to use the standard error of measurement to construct a
95% confidence interval. Assuming that the test’s reliability
coefficient is .84 and its standard deviation is 10, the standard error
of measurement is equal to 4.0:

SEmeas = SDx *(1 – rxx)1/2 = 10 (1 - .84)1/2 = 10(.4) = 4.0

The psychologist constructs the 95% confidence interval by adding
and subtracting two standard errors from the applicant's obtained
score: 80 + 2(4.0) = 72 to 88. This means that there is a 95%
chance that the applicant's true score falls somewhere between 72
and 88.
One problem with the standard error is that
measurement error is not usually equally distributed
throughout the range of test scores.
Use of the same standard error to construct

confidence intervals for all scores in a distribution
can, therefore, be somewhat misleading.
To overcome this problem, some test manuals report

different standard errors for different score intervals.
Estimating True Scores from
Obtained Scores
As discussed earlier, because of the effects
of measurement error, obtained test scores
tend to be biased (inaccurate) estimates of
true scores.
More specifically, scores above the mean of

a distribution tend to overestimate true
scores, while scores below the mean tend
to underestimate true scores.
Estimating True Scores
from Obtained Scores
Moreover, the farther from the mean an
obtained score is, the greater this bias.
Rather than constructing a confidence

interval, an alternative (but less used)
method for interpreting an examinee's
obtained test score is to estimate his/her
true score using a formula that takes into
account this bias by adjusting the
obtained score using the mean of the
distribution and the test's reliability
coefficient.
Estimating True Scores from
Obtained Scores
For example, if an examinee obtains a score of 80
on a test that has a mean of 70 and a reliability
coefficient of .84, the formula predicts that the
examinee's true score is 78.2.
T’=a + bX
=(1-rxx )X + rxx X
T’=(1-.84) x 70 + .84 x 80
=.16 x 70 + .84 x 80
=11.2 + 67=78.2
The Reliability of Difference
Scores
A test user is sometimes interested in
comparing the performance of an
examinee on two different tests or
subtests and, therefore, computes a
difference score. An educational
psychologist, for instance, might
calculate the difference between a
child's WISC-IV Verbal and
Performance scores.
The Reliability of Difference
Scores
When doing so, it is important to keep in mind that the
reliability coefficient for the difference scores can be no
larger than the average of the reliabilities of the two
tests or subtests:
If Test A has a reliability coefficient of .95 and Test B

has a reliability coefficient of .85, this means that
difference scores calculated from the two tests will have
a reliability coefficient of .90 or less.
The exact size of the reliability coefficient for difference

scores depends on the degree of correlation between
the two tests: The more highly correlated the tests, the
smaller the reliability coefficient (and the larger the
standard error of measurement).
Validity
Validity refers to a test's accuracy. A test is valid when it
measures what it is intended to measure. The intended uses for
most tests fall into one of three categories, and each category is
associated with a different method for establishing validity:

 The test is used to obtain information about an examinee's
familiarity with a particular content or behavior domain: content
validity.

 The test is administered to determine the extent to which an
examinee possesses a particular hypothetical trait: construct
validity.

 The test is used to estimate or predict an examinee's standing or
performance on an external criterion: criterion-related validity.
Validity
For some tests, it is necessary to demonstrate only one
type of validity; for others, it is desirable to establish more
than one type.
For example, if an arithmetic achievement test will be used

to assess the classroom learning of 8th grade students,
establishing the test's content validity would be sufficient. If
the same test will be used to predict the performance of 8th
grade students in an advanced high school math class, the
test's content and criterion-related validity will both be of
concern.

Note that, even when a test is found valid for a particular
purpose, it might not be valid for that purpose for all
people. It is quite possible for a test to be a valid measure
of intelligence or a valid predictor of job performance for
one group of people but not for another group.
Content Validity
A test has content validity to the extent that it adequately
samples the content or behavior domain that it is
designed to measure.
If test items are not a good sample, results of testing will be

misleading.
Although content validation is sometimes used to establish

the validity of personality, aptitude, and attitude tests, it is
most associated with achievement-type tests that measure
knowledge of one or more content domains and with tests
designed to assess a well-defined behavior domain.
Adequate content validity would be important for a statistics

test and for a work (job) sample test.
Content Validity
Content validity is usually "built into" a test as it is
constructed through a systematic, logical, and
qualitative process that involves clearly identifying
the content or behavior domain to be sampled and
then writing or selecting items that represent that
domain.
Once a test has been developed, the establishment

of content validity relies primarily on the judgment
of subject matter experts.
If experts agree that test items are an adequate

and representative sample of the target domain,
then the test is said to have content validity.
Content Validity
Although content validation depends mainly on the
judgment of experts, supplemental quantitative
evidence can be obtained.
If a test has adequate content validity:

 a coefficient of internal consistency will be large;
 the test will correlate highly with other tests that purport
to measure the same domain; and
 pre-/post-test evaluations of a program designed to
increase familiarity with the domain will indicate
appropriate changes.

Content Validity
Don’t confuse Content validity with Face validity.
Content validity refers to the systematic evaluation of

a test by experts who determine whether or not test
items adequately sample the relevant domain, while
face validity refers simply to whether or not a test
"looks like" it measures what it is intended to
measure.
Although face validity is not an actual type of validity,

it is a desirable feature for many tests. If a test lacks
face validity, examinees may not be motivated to
respond to items in an honest or accurate manner. A
high degree of face validity does not, however,
indicate that a test has content validity.
Construct Validity
When a test has been found to measure the hypothetical
trait (construct) it is intended to measure, the test is said to
have construct validity. A construct is an abstract
characteristic that cannot be observed directly but must be
inferred by observing its effects. intelligence, mechanical
aptitude, self-esteem, and neuroticism are all constructs.

There is no single way to establish a test's construct
validity. Instead, construct validation entails a systematic
accumulation of evidence showing that the test actually
measures the construct it was designed to measure. The
various methods used to establish this type of validity each
answer a slightly different question about the construct and
include the following:
Construct Validity
 Assessing the test's internal consistency: Do scores on
individual test items correlate highly with the total test
score; i.e., are all of the test items measuring the same
construct?

 Studying group differences: Do scores on the test
accurately distinguish between people who are known to
have different levels of the construct?

 Conducting research to test hypotheses about the
construct: Do test scores change, following an experimental
manipulation, in the direction predicted by the theory
underlying the construct?

Construct Validity
 Assessing the test's convergent and discriminant

validity: Does the test have high correlations
with measures of the same trait (convergent
validity) and low correlations with measures of
unrelated traits (discriminant validity)?

 Assessing the test's factorial validity: Does the
test have the factorial composition it would be
expected to have; i.e., does it have factorial
validity?
Construct Validity
Construct validity is said to be the most
theory-laden of the methods of test
validation.
The developer of a test designed to

measure a construct begins with a theory
about the nature of the construct, which
then guides the test developer in selecting
test items and in choosing the methods
for establishing the test's validity.
Construct Validity
For example, if the developer of a creativity test
believes that creativity is unrelated to general
intelligence, that creativity is an innate
characteristic that cannot be learned, and that
creative people can be expected to generate more
alternative solutions to certain types of problems
than non-creative people, she would want to
determine the correlation between scores on the
creativity test and a measure of intelligence, to see
if a course in creativity affects test scores, and find
out if test scores distinguish between people who
differ in the number of solutions they generate to
relevant problems
Construct Validity
Note that some experts consider construct
validity to be the most basic form of validity
because the techniques involved in establshing
construct validity overlap those used to
determine if a test has content or criterion-
related validity.
Indeed, Cronbach argues that "all validation is

one, and in a sense all is construct validation."
Construct Validity
Convergent and Discriminant Validity:
As mentioned earlier one way to assess a test's

construct validity is to correlate test scores with scores
on measures that do and do not purport to assess the
same trait.
High correlations with measures of the same trait

provide evidence of the test's convergent validity, while
low correlations with measures of unrelated
characteristics provide evidence of the test's
discriminant (divergent) validity.
Construct Validity
The multitrait-multimethod matrix (Campbell & Fiske,
1959) is used to systematically organize the data collected
when assessing a test's convergent and discriminant
validity.
The multitrait-multimethod matrix is a table of correlation

coefficients, and, as its name suggests, it provides
information about the degree of association between two or
more traits that have each been assessed using two or
more methods.
When the correlations between different methods

measuring the same trait are larger than the correlations
between the same and different methods measuring
different traits, the matrix provides evidence of the test's
convergent and discriminant validity.
Multitrait-multimethod
matrix
Example: To assess the construct validity of the
interpersonal assertiveness test, a psychologist administers
four measures to a group of salespeople: ( 1 ) the test of
interpersonal assertiveness; (2) a supervisor's rating of
interpersonal assertiveness; (3) a test of aggressiveness;
and (4) a supervisor's rating of aggressiveness.
The psychologist has the minimum data needed to

construct a multitrait-multimethod matrix: She has
measured two traits that she believes are unrelated
(assertiveness and aggressiveness), and each trait has
been measured by two different methods (a test and a
supervisor-s rating). The psychologist calculates correlation
coefficients for all possible pairs of scores on the four
measures and constructs the following multitrait-
multimethod matrix (the upper half of the table has not
been filled in because it would simply duplicate the
correlations in the lower half):
matrix
A1 B1 A2 B2
Assertiveness Aggressiveness Assertiveness Aggressiveness
Test Test Rating Rating
A1 rA1A1 (.93)
B1 rB1A1 (.13) rB1B1 (.91)
A2 rA2A1 (.71) rA2B1 (.09) rA2A2 (.86)
B2 rB2A1 (.04) rB2B1 (.68) rB2A2 (.16) rB2B2 (.89)

matrix
All multitrait-multimethod matrices contain
four types of correlation coefficients:

 Monotrait-monomethod coefficients ("same
trait-same method")
 Monotrait-heteromethod coefficients ("same
trait-different methods")
 Heterotrait-monomethod coefficients
("different traits-same method")
 Heterotrait-heteromethod coefficients
("different traits-different methods“)
Monotrait-monomethod
coefficients
(or the "same trait-same method")
The monotrait-monomethod coefficients

(coefficients in parentheses in the previous
matrix) are reliability coefficients:
They indicate the correlation between a measure

and itself.
Although these coeffcients are not directly

relevant to a test's convergent and discriminant
validity, they should be large in order for the
matrix to provide useful information.
Monotrait-heteromethod
coefficients
(or "same trait-different methods"):
These coefficients (coefficients in rectangles)

indicate the correlation between different
measures of the same trait.
When these coefficients are large, they provide

evidence of convergent validity.
Heterotrait-monomethod
coefficients
(or "different traits-same method"):
These coefficients (coefficients in ellipses) show

the correlation between different traits that have
been measured by the same method.
When the heterotrait-monomethod coefficients are

small, this indicates that a test has discriminant
validity.
Heterotrait-heteromethod
coefficients
(or "different traits-different methods"):
The heterotrait-heteromethod coefficients

(underlined coefficients) indicate the correlation
between different traits that have been measured
by different methods.
These coefficients also provide evidence of

discriminant validity when they are small
matrix
Note that, in a multitrait-multimethod matrix, only

those correlation coefficients that include the test
that is being validated are actually of interest.
In our example matrix, the correlation between the

rating of interpersonal assertiveness and the rating
of aggressiveness (r = .16) is a heterotrait-
monomethod coefficient, but it isn't of interest
because it doesn't provide information about the
interpersonal assertiveness test.
matrix
Also, the number of correlation
coefficients that can provide evidence of
convergent and discriminant validity
depends on the number of measures
included in the matrix.
In the example, only four measures were

included (the minimum number), but
there could certainly have been more.
matrix
Example: Three of the correlations in our
multitrait-multimethod matrix are relevant to the
construct validity of the interpersonal
assertiveness test.
The correlation between the assertiveness test

and the assertiveness rating (monotrait-
heteromethod coefficient) is .71. Since this is a
relatively high correlation, it suggests that the
test has convergent validity.
matrix
The correlation between the assertiveness test and the
aggressiveness test (heterotrait-monomethod coefficient)
is .13 and the correlation between the assertiveness test
and the aggressiveness rating (heterotrait-heteromethod
coefficient) is .04.
Because these two correlations are low, they confirm that

the assertiveness test has discriminant validity. This
pattern of correlation coefficients confirms that the
assertiveness test has construct validity.
Note that the monotrait-monomethod coefficient for the

assertiveness test is .93, which indicates that the test also
has adequate reliability. (The other correlations in the
matrix are not relevant to the psychologist's validation
study because they do not include the assertiveness test.)
matrix
A1 B1 A2 B2
Assertiveness Aggressiveness Assertiveness Aggressiveness
Test Test Rating Rating
A1 rA1A1 (.93)
B1 rB1A1 (.13) rB1B1 (.91)
A2 rA2A1 (.71) rA2B1 (.09) rA2A2 (.86)
B2 rB2A1 (.04) rB2B1 (.68) rB2A2 (.16) rB2B2 (.89)

Construct Validity
Factor Analysis: Factor analysis is used for

several reasons including identifying the
minimum number of common factors required to
account for the intercorrelations among a set of
tests or test items, evaluating a test’s internal
consistency, and assessing a test’s construct
validity.
When factor analysis is used in the latter

purpose, a test is considered to have construct
(factorial) validity when it correlates highly only
with the factor(s) that it would be expected to
correlate with.
Reliability & Validity
Which is plausible?
1. An instrument that is reliable, but not valid?
OR
2. An instrument that is valid, but not reliable?
Why is this the case?

Rel&val

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Rel&val

Hochgeladen von

Copyright:

Verfügbare Formate

Validity/Reliability

Measurement error is random error. It is

When a test is reliable, it provides

Ideally, a test's reliability would be

True Score Variance/Total Variance =

There are several ways to estimate a test's

The common assumption for each of these

Most methods for estimating reliability

The subscript indicates that the

Figure 1. Proportion of variability in test scores

True Score Variability (84%) Error (16%)

Note that a reliability coefficient does not provide any

A reliability coefficient only indicates whether the

Whether the test is actually assessing what it was

Study Tip: Remember that, in contrast to other correlation

Each method not only entails different

The test-retest method for estimating

These time sampling factors include random

Memory and practice also contribute to error

It would be appropriate for a test of aptitude,

Alternate forms reliability indicates the

 and the coefficient of equivalence and

The primary source of measurement error

In this situation, the two scores obtained by

When administration of the two forms is

Like test-retest reliability, alternate

 When these effects differ for different examinees

 Although alternate forms reliability is considered

Split-half reliability and coefficient alpha are two

Scores on the two halves are then

If a test contains 30 items, each score is based on 15 items.

For this reason, the split-half reliability coefficient is ordinarily

One way to interpret coefficient alpha is as the

When test items are scored dichotomously (right or

 For split-half reliability, content sampling refers to the

 for coefficient alpha, content (item) sampling refers to

A test is heterogeneous with regard

Coefficient alpha could be expected to be smaller

They are not appropriate for assessing the reliability

A test constructor would want to make sure

This is a particular problem for behavioral

In this situation, the degree of chance agreement

An inter-rater reliability coefficient is likely

(Observer drift can also affect a single observer's

Study Tip: Remember the Spearman-Brown formula is related

Consequently, a general rule is that the

For instance, if a 100-item test has a reliability

A problem with the Spearman-Brown formula is that

Note that, when used to correct the split-half

The range is directly affected by

The range is also affected by the difficulty level of

When all items are either very difficult or very easy,

Therefore, the best strategy is to choose items so

As the probability of correctly guessing answers

All other things being equal, a true/false test will have

A reliability coefficient of .84 indicates that 84% of

While different types of tests can be expected to have

Instead, a test's reliability coefficient can vary

When an examinee receives a score of 80 on a

The confidence interval helps a test user estimate

This range is calculated using the standard error of

SEmeas = SDx *(1 – rxx)1/2

 the standard deviation of the test scores (SDx),