Beruflich Dokumente
Kultur Dokumente
net/publication/313966026
Generalizability Theory
CITATIONS READS
0 809
4 authors, including:
Richard J. Shavelson
Stanford University
260 PUBLICATIONS 17,911 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Richard J. Shavelson on 28 December 2017.
Caspi, A., & Moffitt, T. E. (2006). Risch, N., Herrell, R., Lehner, T., Liang, K., Eaves,
Gene-environment interactions in psychiatry: L., Hoh, J., … Merikangas, K. (2009). Interaction
Joining forces with neuroscience. Nature Reviews between the serotonin transporter gene
Neuroscience, 7, 583–590. (5-HTTLPR), stressful life events, and risk of
Caspi, A., Moffitt, T. E., Cannon, M., McClay, J., depression: A meta-analysis. Journal of the
Murray, R., Harrington, H., … Craig, I. W. American Medical Association, 301(23),
(2005). Moderation of the effect of 2462–2471.
adolescent-onset cannabis use on adult psychosis van Winkel, R., Stefanis, N. C., & Myin-Germeys, I.
by a functional polymorphism in the (2008). Psychosocial stress and psychosis: A
catechol-O-methyltransferase gene: Longitudinal review of the neurobiological mechanisms and
evidence of a gene X environment interaction. the evidence for gene-stress interaction.
Biological Psychiatry, 57, 1117–1127. Schizophrenia Bulletin, 43, 1095–1105.
Caspi, A., Sugden, K., Moffitt, T. E., Taylor, A., Wong, M. Y., Day, N. E., Luan, J. A., Chan, K. P., &
Craig, I. W., Harrington, H., … Poulton, R. Wareham, N. J. (2003). The detection of
(2003). Influence of life stress on depression: gene-environment interaction for continuous
Moderation by a polymorphism in the 5-HTT traits: Should we deal with measurement error by
gene. Science, 301(5631), 386–389. bigger studies or better measurement?
Cicchetti, D., & Rogosch, F. A. (2012). Gene × International Journal of Epidemiology, 32(1),
environment interaction and resilience: Effects of 51–57.
child maltreatment and serotonin, corticotropin
releasing hormone, dopamine, and oxytocin Further Reading
genes. Development and Psychopathology, 24, Caspi, A., Hariri, A. R., Holmes, A., Uher, R., &
411–427. Moffitt, T. E. (2010). Genetic sensitivity to the
Karg, K., Burmeister, M., Shedden, K., & Sen, S. environment: The case of the serotonin
(2011). The serotonin transporter promoter transporter gene and its implications for studying
variant (5-HTTLPR), stress, and depression complex diseases and traits. American Journal of
meta-analysis revisited: Evidence of genetic Psychiatry, 167, 509–527.
moderation. Archives of General Psychiatry, Monroe, S., & Reid, M. (2008). Gene-environment
68(5), 444–454. interactions in depression research: Genetic
Kim-Cohen, J., Caspi, A., Taylor, A., Williams, B., polymorphisms and life-stress polyprocedures.
Newcombe, R., Craig, I. W., & Moffitt, T. E. Psychological Science, 19(10), 947–956.
(2006). MAOA, maltreatment, and
gene-environment interaction predicting
children’s mental health: New evidence and a
meta-analysis. Molecular Psychiatry, 11,
903–913. Generalizability Theory
Kim-Cohen, J., & Turkewitz, R. (2012). Resilience
and measured gene-environment interactions. Min Li,1 Richard J. Shavelson,2,3
Development and Psychopathology, 24, Yue Yin,4 and Ed Wiley2
1297–1306. 1 University of Washington, Seattle, U.S.A., 2 SK Partners,
Mehta, D., & Binder, E. B. (2012). Gene × U.S.A., 3 Stanford University, U.S.A., and 4 University of
environment vulnerability factors for PTSD: The Illinois, Chicago, U.S.A.
HPA-axis. Neuropharmacology, 62, 654–662.
Meyer-Lindenberg, A., & Weinberger, D. R. (2006). Generalizability (G) theory is a psychome-
Intermediate phenotypes and genetic
tric theory based on a statistical sampling
mechanisms of psychiatric disorders. Nature
Reviews Neuroscience, 7, 818–827.
approach that partitions scores into their
Pelayo-Teran, J. M., Suarez-Pinilla, P., Chadi, N., & underlying multiple sources of variation. G
Crespo-Facorro, B. (2012). Gene-environment theory was initially introduced by Cronbach
interactions underlying the effect of cannabis in and his colleagues (Cronbach, Rajaratnam, &
first episode psychosis. Current Pharmaceutical Gleser, 1963) as an extension of Cronbach’s
Design, 18, 5024–5035. classic paper on internal-consistency alpha
GENERALIZABILITY THEORY 1323
A particular measurement procedure is then represent the fact that, over all 20 interviews,
a combination of specific facet conditions one interviewer tended to elicit informa-
sampled from such a universe. tion indicating systematically higher levels of
In the survivor study (Gleser et al., 1978), impairment than did the other interviewer.
three raters scored the interviews of 20 sur- G theory, supported by statistical procedures,
vivors conducted by two interviewers. Making allows simultaneous estimation of the variance
use of two facets, interviewer and rater, this due to each source, which is called a variance
study is a two-facet design of survivor (s), component.
interviewer (i), and rater (r), denoted s × i × r.
The universe of generalization is constituted Relative Decisions and Absolute
by an infinite number of interviewers and by Decisions
all the raters who can quantify the interviews. In G theory, the reliability coefficient is defined
A universe score for a particular survivor is as the proportion of score variance attributable
then the expected value of all the admissible to person with respect to the total variance
observations of this person from the universe due to both person and various sources of
of generalization. The particular measurement measurement error. G theory recognizes that
sampled from the universe contains two levels decision makers using a measurement, such as
at the interviewer facet and three levels at the psychologists, clinicians, researchers, parents,
rater facet. The interviewers and the raters policy makers, and managers, may want to
defined in the universe are not simply any make two types of decisions with the scores:
random interviewers or raters. Instead, the relative and absolute. Corresponding to these
universe of interviewers should be regarded two types of decisions, different G coefficients
as comparable to the two study interviewers can be computed.
concerning, for example, the type and amount Relative decisions are associated with
of training received, and the interview prompts norm-referenced interpretations of scores.
used. The raters should be comparable to the The decisions concern the consistency of
three raters sampled in the study in terms scores in ranking or sorting individuals
of, for example, scoring experience, type and according to differences in their personal-
amount of training received, and familiarity ity, behaviors, knowledge, and/or skills. For
with the measured construct. example, relative decision is involved when
Partitioning of Measurement Error we examine whether different items measur-
G theory offers a framework to conceptualize ing ability of matching familiar figures rank
and estimate multiple sources of score variance. patients consistently for impulsivity level, or
One critical step in applying G theory is to par- whether scores from different interviewers
tition the variance of observed scores into the judged by different raters sort survivors con-
variance due to person and the measurement sistently for their psychiatric impairment. In
error due to the main and interaction effects of contrast, absolute decisions are associated with
multiple sources (facets). criterion-referenced or domain-referenced
In the survivor study, the survivor is the interpretations of scores. They concern index-
object of measurement, and measurement ing the exact or absolute level of individuals’
error may arise from inconsistencies due to construct for a given domain regardless of
the two facets (interviewer and rater). More the performance of others, such as driving
specifically, the measurement error can be tests in everyday life. The measurement deci-
attributed to sampling errors with inter- sions are made based on the absolute level
viewers, raters, their interactions, and other of scores from varying items regardless of
unspecified sources. For instance, measure- the relative ranking among the measured
ment error associated with interviewers might persons.
GENERALIZABILITY THEORY 1325
Generalizability Studies and Decision the fact that only some of the sampled levels
Studies of a given facet are present for each level of
G theory distinguishes between generalizability the other facet(s). The survivor study is a fully
(G) studies and decision (D) studies. The pur- crossed s × i × r design as this measurement
pose of a G study is to estimate components of procedure has all the persons interviewed by
score variance associated with various sources. the two interviewers and then scored by four
A D study takes these estimated variance com- raters. But suppose that, for practical reasons,
ponents to evaluate and optimize among alter- each participant is interviewed by different sets
natives for subsequent measurement. of two interviewers using the same interview
Let us turn again to the survivor study. protocol, but the interview responses are ana-
Variance components can be estimated from lyzed by four raters. In this case, interviewer
the G study. They indicate the magnitude is nested within survivor (which is denoted as
of estimated variance associated with the i : s, where “:” is read as “nested with”) because
measurement procedure using only one not all of its sampled levels in the interviewer
interviewer and one rater, which is similar facet are represented with scores for all the
to a single-measure intraclass correlation survivors. This particular design, denoted as
coefficient. These magnitudes-of-variance (i : s) × r, considers the following five sources of
components then form the basis for a series of score variation: s; r; i : s; sr; and ir : s, e. Some-
D studies to make decisions for future measure- times, person as the object of measurement
ment procedures as the sampling of some facets can be nested within a facet. For example, a
is changed. For instance, one might be inter- researcher may examine the use of an obser-
ested in reducing the number of interviewers vation protocol on individuals’ interaction
but increasing the number of raters to maintain with peers in the group therapy setting where
the reliability of the measurement of psychic clients are nested within a group observed by
injury (n′i = 1 and n′r = 5 instead of ni = 2 and two raters on five occasions. This design can
nr = 3). Variance components and reliability be represented as (Person : Group) × Rater ×
coefficients can be calculated to determine Occasion.
whether this hypothesized procedure meets Second, designs can be random or mixed
the reliability requirement and identify the models. Random models have only random
minimum number of interviewers and the facets, whereas mixed models have both ran-
minimum number of raters needed to obtain dom and fixed facets. G theory is essentially
dependable scores. a random effects theory. Typically, a random
facet is generated if the levels of the studied
Designs Used in G and D Studies measurement are randomly sampled from all
There are a variety of designs that decision the levels of a facet. A facet is also considered as
makers can select from when conducting their random when the intended universe of gener-
G and D studies. These designs are specified alization for this facet is infinitely large and the
with respect to three important properties levels included in the universe are exchange-
of facets. They are pertinent to theoriza- able to the selected levels even though these
tion of sampling the universe and therefore levels have not been randomly sampled, for
generalization of score interpretations. example, the facets of interviewers and raters.
First, designs can be crossed or nested when In contrast, a decision maker may consider
facets are either crossed or nested with the a facet as fixed when (a) the universe of gen-
object of measurement (person, in most cases) eralization is rather limited or finite, so that
or other facets. Crossed refers to the designs all the levels of the facet are included in the
in which all the sampled levels of any facet studied measurement; or (b) the studied levels
are present for all persons and for all levels of involved in the measurement are deliberately
any other facet(s). In contrast, nested refers to selected based on theoretical reasons, and the
1326 GENERALIZABILITY THEORY
generalize from employees’ performance rated one-facet crossed design and one-facet nested
by two raters to a much larger set of raters, design.
rater is a facet. The universe would be defined
One-Facet Crossed Design
by all admissible raters (e.g., all the raters who
have the expertise to rate the employees). In the one-facet crossed design, each person is
Using the impulsivity test as an example, given the same random sample of items. The
decision makers are typically indifferent to design is denoted as p × i, where p refers to
the particular test score on one set of items. person (or examinee), i refers to item, and ×
Instead, they would be quite willing to sub- means “crossed with.” The statistical model
stitute another set of items to measure the used to estimate variance components in the
construct that they are interested in. In other one-facet crossed design is described next.
words, the decision makers are more interested Statistical model. The observed score for
in an examinee’s general responses to any items any person on any item is denoted as Xpi . The
from the universe of generalization rather than universe score, denoted as μp , is defined as
that examinee’s score on any particular set of a person’s average score over the entire item
items. universe, Ei (Xpi ), where the symbol E is an
A one-facet p × i design has four sources expectation operator and the subscript i desig-
of variability. The first source is due to the nates that the expectation is taken over items.
systematic differences among persons, the Similarly, μi is the population mean for item
object of measurement. It reflects individuals’ i, defined as Ep (Xpi ), the expected value of Xpi
personality, attitudes, behaviors, performance, over persons. μ is the grand mean over both
knowledge, skills, and so on. The second the population and universe, Ep Ei (Xpi ).
source of variability is due to the difference in An observed score for one person on one
test items. Some items are easy, some difficult, item (Xpi ) can be decomposed as follows:
and some in between for all the individuals.
The third source of variability is due to the Xpi = μ[grand mean]
interaction between individuals and items.
+ (μp − μ) [person effect]
For example, an item testing individuals’ skills
of addition in the context of money would + (μi − μ) [item effect]
be easier for a student who has received an
+ (Xpi − μp − μi + μ)[person × item
allowance from parents than one who has
not. The fourth source of variability, which interaction confounded with residual]
is usually confounded with the third source, (1)
may be due to randomness (e.g., lucky guesses
or careless mistakes), other systematic but Each effect other than the grand mean in
unidentified or unknown sources of variability Equation 1 has a distribution with a mean
(e.g., individuals taking the test on different of zero and its own variance σ2 , called the
occasions or in different classroom settings), variance component. For example, the vari-
or both. ance component for person effect is defined as
Once the magnitude of each source of score σ2p = Ep (μp − μ). 2 That is, it is the average of
variation is estimated, the measurements the squared deviations of the persons’ universe
already made in the G study can be evaluated. scores from the grand mean. The variance
In addition, D studies can be conducted to component for items can be defined similarly
redesign measurements in order to reach a as σ2i . The final effect, the p × i interaction
desirable reliability (e.g., lengthening the test confounded with residual, also has a mean
and hiring more raters). For a one-faceted of zero and a variance denoted as σ2pi,e . This
universe, two designs could be used in G and effect includes the p × i interaction effect, which
D studies to estimate variance components: the reflects the fact that not all people find the same
1328 GENERALIZABILITY THEORY
items easy or difficult, confounded with unsys- variance in item difficulty, which influences
tematic variability and unspecified variability everybody, does not influence the ranking
from the hidden facets not explicitly included of individuals. The dependability coefficient
or controlled in the one-facet G study because (Φ) is calculated as σ2p ∕(σ2p + σ2Abs. ) for abso-
with only one observation per cell in the per- lute decisions. Here, all variance components
son × item data matrix, the interaction cannot except the object of measurement contribute to
be disentangled from “within-cell” variability. the absolute measurement error, that is, σ2i and
In sum, the variance of the collection of σ2pi,e in this one-facet crossed design. Usually,
observed scores, Xpi ; overall persons; and items the generalizability coefficient is larger than or
in the universe is expressed as the sum of the equal to the dependability coefficient.
three variance components: Numerical example for the G study. Let’s use
a specific example to illustrate how to estimate
σ2X = σ2p + σ2i + σ2pi,e (2) variance components for the one-facet crossed
pi
design. Suppose that eight items were given
That is, the variance of item scores can be par- to 10 individuals, and the data are displayed
titioned into independent sources of variation in Table 1. Items are regarded as random
due to differences between persons as objects of samples from the infinite universe of all items,
measurement, σ2p , and two sources of measure- and persons are randomly sampled from the
ment error: variation due to item, σ2i , and vari- population of individuals.
ation due to the p × i interaction confounded Analysis of variance (ANOVA) can be used
with the residual, σ2pi,e . to estimate the variance components. Several
In the one-facet crossed design, the gen- statistical packages are available for conducting
eralizability coefficient (ρ2 ), calculated as G and/or D studies based on the ANOVA
σ2p ∕(σ2p + σ2Rel. ), pertains to relative decisions. approach: (a) GENOVA and urGENOVA
Specifically, the variance component, σ2pi,e , (along with mGENOVA), authored by Bren-
contributes to relative measurement error nan as a family of programs; (b) the Variance
because the interaction between person and Component in SPSS; and (c) Proc Varcomp
item influences the relative standing of indi- in SAS. The GENOVA suite of computer pro-
viduals. In contrast, the variance component grams can run G and D studies for a variety of
due to item effect, σ2i , does not contribute designs when facets have different properties
to relative measurement error because the and missing values may be present. In contrast,
Table 1 Item scores on eight items of 10 examinees (modified from Brennan, 2001, p. 28).
Item Person
Person 1 2 3 4 5 6 7 8 Mean
1 1 0 1 0 0 0 0 0 0.250
2 1 1 1 0 0 1 0 0 0.500
3 1 1 1 1 1 0 0 0 0.625
4 1 1 0 1 1 0 0 1 0.625
5 1 1 1 1 1 0 1 0 0.750
6 1 1 1 0 1 1 1 0 0.750
7 1 1 1 1 1 1 1 0 0.875
8 1 1 1 1 0 1 1 1 0.875
9 1 1 1 1 1 1 1 1 1.000
10 1 1 1 1 1 1 1 1 1.000
Item mean 1 0.9 0.9 0.7 0.7 0.6 0.6 0.4 0.725
GENERALIZABILITY THEORY 1329
Table 2 ANOVA estimates of variance components for the one-facet crossed design data in Table 1.
the SPSS or SAS programs can only produce D studies. It is important to notice that the
results for G studies; users then have to use variance components estimated in Table 2
the variance components estimated by the two reflect the variance of a one-item test; there-
programs to calculate parameters for D studies. fore, the generalizability coefficients calculated
Two-way ANOVA, with score as the are associated with scores from the one-item
dependent variable, and person and item test. It is not surprising that the coefficients
as independent variables, is conducted to are rather low. If multiple items (n′i ) are
calculate mean squares (MSs) for persons, given to the examinees, the magnitude of
items, and residual. Meanwhile, the MSs can be variance components and coefficients will
expressed as the expected mean square (EMS), be improved. D studies can be conducted
weighted sums of variance components, as to estimate how the magnitude of variance
follows (for derivations of EMS equations, see components and the corresponding general-
Kirk, 1982): izability coefficients vary by the number of
items.
EMSp = σ2pi,e + ni σ2p From the central limit theorem in statis-
EMSi = σ2pi,e + np σ2i tics, it is well known that the variance
of mean scores is the variance of indi-
EMSpi,e = σ2pi,e vidual scores divided by the sample size
of scores from which the mean is calcu-
These equations can be used to estimate each lated. Therefore, the variance component
variance component, after the EMS is replaced associated with n′i items, when the score
with corresponding observed MSs and the interpretation is based on the mean of
exact value of σ2 is replaced with estimated scores from n′i items, can be estimated as
σ2 , ̂
σ2 . The equations can be solved in order to follows:
estimate each variance component as follows:
σ2i′ = σ2i ∕n′i
̂
σ2pi,e = MSpi,e
σ2pi,e′ = σ2pi,e ∕n′i
̂
σ2p = (MSp − ̂
σ2pi,e )∕ni = (MSp − MSpi,e )∕ni
In the example discussed earlier, if 10 items
̂
σ2i = (MSi − ̂
σ2pi,e )∕np = (MSi − MSpi,e )∕np rather than one item are used, both σ2i and
Table 2 presents the ANOVA result and σ2pi,e will become one tenth of the original
the variance components estimated by using ones; that is, σ2i′ = 0.00246 and σ2pi,e′ = 0.0147.
the formulas in this subsection. Based on the Correspondingly,
estimated variance components, ρ2 for rela-
tive decision is calculated as σ2p ∕(σ2p + σ2pi,e )= ρ2 = σ2p ∕(σ2p + σ2pi,e ∕n′i )
0.0365/(0.0365 + 0.1470) = 0.199. Φ for abso- = 0.0365∕(0.0365 + 0.147∕10)
lute decision is calculated as σ2p /(σ2p +σ2i +σ2pi,e )
= 0.0365 / (0.0365 + 0.0246 + 0.1470) = 0.175. = 0.713
1330 GENERALIZABILITY THEORY
In a one-facet nested design, each person The ANOVA procedure can then be used to
is given a different set of random items. calculate the MS and then estimate the variance
For instance, in the example of item as components.
the nested facet with 10 students, the eight Numerical example. To use space efficiently,
items given to each person are all dif- this section continues to use the data in Table 1,
ferent. That is, items 1 to 8 are given to but now imagine that different persons take dif-
person 1, items 9 to 16 are given to per- ferent samples of items. That is, let us assume
son 2, and so on until items 73 to 80 are that Items 1 to 8 are different sets for different
given to person 10. The design is denoted as persons. Table 3 presents the ANOVA results
i : p. and the estimated variance components.
Table 3 ANOVA estimates of variance for data in Table 1 (assuming that the data are from a one-facet
nested design instead of a one-facet crossed design).
Once variance components are estimated, ρ2 both crossed design and nested design used in
and Φ coefficients can be calculated. Unlike D studies will be described. In the descriptions,
a one-facet crossed design, errors for relative the basic concepts for G theory will be reviewed
decision and absolute decision are the same to articulate and stress the essence of applying
because error variance associated with item G theory in approaching the measurement
main effect cannot be distinguished from issues.
other sources of errors. Accordingly, ρ2 = Φ = A multifacet universe is defined by more
0.0335 / (0.0335 + 0.1710) = 0.164. than one facet, which may be crossed or nested
D studies. Similar to D studies for a one-facet with the object of measurement (person) and
crossed design, when the number of items other facet(s), and may be random or fixed. If
increases, the error variance decreases and a psychologist plans to generalize from clients’
the coefficients of ρ2 and Φ increase. D responses from a set of questionnaire items on
studies allow decision makers to estimate one occasion to a larger set of items on a much
how the magnitude of coefficients changes larger set of occasions, item and occasion are
with the number of items and determine the the two facets involved. The universe would
number of items required to reach certain be defined by the infinite number of items
coefficients. on all admissible occasions. If a researcher
Because the variance of the person effect (σ2p ) wants to generalize from toddlers’ interaction
is the variance of universe score that will not with their peers rated by two raters on eight
change, the D study only needs to calculate the indicators to a much larger set of raters on a
variance component for the nested effect con- much larger set of indicators, then both rater
founded with the residual when using n′i items. and indicator are facets that should be con-
The formula is σ2i∶p,e′ = σ2i∶p,e ∕n′i . sidered. The universe would be defined by all
Suppose that 10 items are given to examinees. admissible raters (e.g., all the raters who have
Coefficients for relative decision and absolute the expertise to rate the children) on all admis-
decision can be calculated as ρ2 = Φ = 0.0335 sible indicators of the construct of interest.
/ (0.0335 + 0.1710 / 10) = 0.662. Similar to the As described in these examples, studies that
solution in one-facet crossed design, to reach a involve two or more facets are called multifacet
coefficient of 0.80, the following equation needs designs.
to be solved as ρ2 = Φ = 0.0335 ∕ (0.0335 + Using the item and rater facets as a con-
0.1710∕n′i ) = 0.80. n′i is found to be 20.4; that crete example, there may potentially be a
is, at least 21 items are needed to reach coeffi- variety of designs that the psychologist can
cients of 0.80. consider for her study. Table 4 lists some
examples. Depending on the structure of facets
Multifacet Designs (e.g., how facets are crossed or nested), the
This section uses designs with two facets as variability of observed scores is partitioned
multifacet examples to describe the crossed into different main effects and interaction
and nested models used in G and D studies. effects, except the person effect as the object
Because the difference between crossed and of measurement (for most cases). The math-
nested designs and the statistical models have ematical calculations for these effects differ
been described already in a previous section as well, but this is not the focus of this entry
of this entry, this section focuses on the inter- (for additional explanation, see Brennan,
pretation of variance components involving 2001).
multiple facets without getting into detail about Among all the possible designs that a
the technical aspects and describing extensively researcher can select, the fully crossed design
each of the two different designs due to space (p × i × r) is most advantageous for the G study,
constraints. Instead, a G study example of a because other nested designs, or a crossed
crossed design will be introduced, and then design with different sampling of items and
1332 GENERALIZABILITY THEORY
raters, can be examined in the D studies based can be calculated using the relative mea-
on the estimated variance components in the surement error and absolute measurement
G study. However, the utility of the nested error, respectively. The D studies involve a
designs in the D studies can be rather limited. decision-making process that considers the
For example, the design of either i : (p × r) or pattern of variance components associated
(i × r) : p can be considered in the D stud- with multiple facets, evaluates the options
ies when a G study employs the design of available by manipulating the properties of
p × i × r, but not vice versa, because an effect facets, and decides the optimum measurement
confounded with other effects in the G study procedure that elegantly balances the technical
cannot be decomposed subsequently in D stud-
requirements and the logistic cost. In what
ies. Therefore, if resources allow, researchers
follows, two designs, p × i × r for the multifacet
should consider using the fully crossed design
crossed design and p × (i : r) as an example of a
in the G studies as this design offers more flex-
multifacet nested design, are described to illus-
ibility for the designs in D studies. Given the
constraints imposed by resources, researchers trate the reasoning for partitioning the score
then need to specify the needed resources variability, estimating the generalizability and
in terms of items constructed, raters to be dependability coefficients, and conducting D
trained, and other logistics (e.g., recruiting studies.
raters or item developers). Assume that the A G Study Example Using the Multifacet
measurement procedure to be studied implies Crossed Design
10 items and two raters. For example, the
In the multifacet crossed design, each person is
design of p × (i : r) indicates that five items will
given the same random sample of items scored
be scored by Rater 1 and another five items
by the same sample of raters. The design is
by Rater 2, whereas the design of (i × p) : r
denoted as p × i × r, where p refers to person
indicates that each person will take a mea-
surement with a different set of 10 items, (or examinee) as object of measurement and i
but all the responses will be scored by two and r refer to item and rater facets respectively.
raters. Therefore, the selection of the study The statistical model used to estimate variance
design is largely determined by the intended components in this crossed design is described
generalizations in the D studies, as well as by below.
the resources available to conduct the data Statistical model The observed score for
collection. any person on any item given by any rater is
Similar to the one-facet design, the gen- denoted as Xpir . This observed score in the
eralizability and dependability coefficients two-facet crossed design can be decomposed
GENERALIZABILITY THEORY 1333
Item
1 2 3 … 10
Person Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
1 4 4 4 1 1 2 2 2 3 3 4 4
2 2 3 3 2 2 2 1 2 1 2 2 2
3 4 4 4 1 1 1 1 2 2 3 3 4
…
N = 50 3 4 3 1 2 1 2 2 1 4 4 4
GENERALIZABILITY THEORY 1335
the score variance, which is a rather negligible Examples of D Studies Using the Crossed
source of measurement error. This shows that, and Nested Designs
despite the fact that some items more likely Based on the estimated variance components in
lead to slightly higher scores for all individuals Table 6, more items and raters are apparently
than other items, this unevenness in item needed in order to increase the generalizabil-
difficulty contributes little to the overall score ity or dependability of score inferences. Assume
variation when other variations are taken into that the D studies still use the crossed design
consideration. Finally, other components (i.e., with both facets as random. Table 7 reports the
σ2r , σ2pr , and σ2ir ), all of which are associated estimated variance components, measurement
with the rater facet, appear not to contribute errors, and generalizability and dependability
much to the score variability. This provides coefficients when the number of items is 5, 10,
evidence that the training of raters and the or 15 and the number of raters is 2, 3, or 4. As
use of scoring systems minimize the potential more items and more raters are included in a
measurement error from the rater main effect measurement procedure, smaller measurement
and other related interaction effects. error and higher coefficients result.
After pinpointing relative and absolute mea- Alternatively, the results from D studies can
surement errors, the generalizability coefficient be presented in figures. For instance, Figure 1
(ρ2 ) and dependability coefficient (Φ) are to be shows the dependability coefficients for given
calculated when only one item and one rater are numbers of items (n′i ranging from 2 to 14)
sampled from the universe of admissible obser- and raters (n′r ranging from 2 to 4). Table 7 and
vations as follows: Figure 1 suggest that the benefit of increasing
items in order to reduce the amount of error
ρ2 = σ2p ∕(σ2p + σ2Rel. ) variance starts to diminish at approximately
nine items, and the benefit of having more
= σ2p ∕(σ2p + σ2pi + σ2pr + σ2pir,e )
raters reaches a plateau at three raters.
= 0.1884∕(0.1884 + 0.2132 + 0.0028 The results of D studies also make it evi-
dent that the development, specification, and
+ 0.2589) = 0.284
evaluation of new measurement procedures
Φ = σ2p ∕(σ2p + σ2Abs. ) = σ2p ∕(σ2p + σ2i + σ2r can be handled systematically by the G theory
approach. For example, in order to maintain
+ σ2pi + σ2pr + σ2ir + σ2pir,e ) a dependability coefficient at 0.80, a measure-
= 0.1884∕(0.1884 + 0.0111 + 0 + 0.2132 ment procedure of using eight items and two
raters or seven items with three raters can
+ 0.0028 + 0.0014 + 0.2589) = 0.279 be considered. Depending on the financial
and human resources, a decision maker can
Table 7 Estimated variance components (EVCs) for p × i × r random effects design based on results of the G study in Table 6.
Person (p) 0.1884 σ2p 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188
Item (i) 0.0111 σ2i /n′i 0.002 0.001 0.001 0.002 0.001 0.001 0.002 0.001 0.001
Rater (r) 0.0000 σ2r /n′r 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
p×i 0.2132 σ2pi /n′i 0.043 0.021 0.014 0.043 0.021 0.014 0.043 0.021 0.014
p×r 0.0028 σ2pr /n′r 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
2
i×r 0.0014 σir /(n′i n′r ) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
p × i × r, e 0.2589 σ2pir,e /(n′i n′r ) 0.026 0.013 0.009 0.017 0.009 0.006 0.013 0.006 0.004
Relative measurement error (σ2Rel. ) 0.070 0.035 0.024 0.061 0.031 0.021 0.057 0.028 0.019
Absolute measurement error (σ2Abs. ) 0.072 0.036 0.025 0.063 0.032 0.022 0.059 0.030 0.020
Generalizability coefficient (ρ2 ) 0.729 0.843 0.887 0.755 0.858 0.900 0.767 0.870 0.908
Dependability coefficient (Φ) 0.723 0.839 0.883 0.749 0.855 0.895 0.761 0.866 0.904
GENERALIZABILITY THEORY 1337
0.90
0.85
0.75
n=2
n=3
n=4
0.70
0.65
0.60
0.55
2 4 6 8 10 12 14
Number of items
compare these measurement procedures and and the last set of 10 items by Rater 3. For the
select the more efficient one with adequate two random facets, item is a nested facet and
psychometric requirements. rater is a crossed facet. In the i : r : p design, both
As explained earlier, a decision maker may facets are nested facets. This design requires
choose the nested design to plan the possible each individual to be tested by a different set
measurement procedures in the D studies of 30 items and scored by a different group of
given the logistic considerations. In this case, three raters while each rater only graded 10
each person does not have scores from all the items. Based on results of the crossed G study
levels of facets. There are many variations of from Table 6, Table 8 presents the estimated
nested designs, depending on the relationship variance components of a series of D studies
between the object of measurement and the using the two nested designs. Similar to the D
facets, for example p × (i : r) or i : r : p. Each studies for one-facet designs, the error variance
requires a different procedure to compile
can be reduced and the coefficients of ρ2 and
test items and arrange the scoring. In what
Φ are increased when more levels of item and
follows, these two nested designs are used
rater facets are sampled from the universe.
to illustrate the conceptual understanding for
conducting D studies and interpreting variance Depending on the magnitudes of variance
components. components reported in the G studies, the
Again, 10 items and three raters are the patterns of estimated variance components for
measurement conditions sampled from the different designs can differ dramatically or can
universe. In the p × (i : r) design, each indi- be similar, as shown in the two examples in
vidual would be administered the same set of Table 8. Considering that facets can be fixed
30 items, among which 10 items are scored by or random in D studies, decision makers can
Rater 1, another 10 items are scored by Rater 2, propose even more variations of complicated
Table 8 Estimated variance components (EVCs) for p × (i : r) and i : r : p random effects designs based on results of the p × i × rG study in Table 6.
p × ( i : r ) Design
Person (p) 0.188 σ2p 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188
Rater (r) 0.000 σ2r ∕n′r 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Item (i:r) 0.012 σ2i ∶ r ∕(n′i × n′r ) 0.001 0.001 0.000 0.001 0.000 0.000 0.001 0.000 0.000
p×r 0.003 σ2pr ∕n′r 0.002 0.002 0.002 0.001 0.001 0.001 0.001 0.001 0.001
pi : r, e 0.472 σ2pi ∶ r,e ∕(n′i × n′r ) 0.047 0.024 0.016 0.031 0.016 0.010 0.024 0.012 0.008
Generalizability coefficient (ρ2 ) 0.793 0.879 0.913 0.855 0.917 0.945 0.883 0.935 0.954
Dependability coefficient (Φ) 0.790 0.874 0.913 0.851 0.917 0.945 0.879 0.935 0.954
i : r : p Design
Person (p) 0.188 σ2p 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188
r:p 0.003 σ2r ∶ p ∕n′r 0.002 0.002 0.002 0.001 0.001 0.001 0.001 0.001 0.001
i : r : p, e 0.484 σ2i ∶ r ∶ p,e ∕(n′i × n′r ) 0.048 0.024 0.016 0.032 0.016 0.011 0.024 0.012 0.008
Generalizability coefficient (ρ2 ) 0.790 0.879 0.913 0.851 0.917 0.940 0.883 0.935 0.954
Dependability coefficient (Φ) 0.790 0.879 0.913 0.851 0.917 0.940 0.883 0.935 0.954
GENERALIZABILITY THEORY 1339
1992), resampling methods such as boot- taken into account when the variance com-
strap and jackknife methods (e.g., Brennan, ponents are examined. Large standard errors
Harris, & Hanson, 1987; Wiley, 2000), and indicate the instability of the estimates of
Bayesian analysis (e.g., Brennan, 2001). variance components; that is, the variance
Generally, the latter two groups of meth- components are not robust due to the small
ods are less sensitive to the assumptions of sample size of subjects. In particular, when
score distribution, especially for dichotomous the absolute value of a negative estimate is
scores. relatively larger than its standard error, that
When conducting G studies with the variance component then should not be treated
ANOVA method, researchers may encounter as zero. Finally, G theory can be applied to esti-
negative values for the estimated variance mate the individual-level standard error by
components, mostly in multifacet designs. acknowledging, as Brennan and his colleagues
Conceptually, variance should not be negative; (Brennan & Lee, 1999; Lee, Brennan, & Kolen,
therefore, the negative values often signal 2000) point out, that the standard error of
problems due to sample error or model mis- measurement of raw scores or scale scores
specification (Shavelson & Webb, 1981); in is conditional to the level of the measured
construct; for example, some individuals are
other words, the levels for certain facet(s) are
more inconsistent in responding to items
not comparable. Two strategies are often used
than others when they frequently guess the
to address negative estimates if the values
responses. This adds complexity in estimat-
are small: (a) Cronbach’s algorithm, which
ing and interpreting the standard error of
substitutes zero for the negative estimate,
measurement.
and uses zero wherever needed to calculate
other variance components (Cronbach et al., SEE ALSO: Construct Validity; Reliability
1972); and (b) Brennan’s algorithm, which
uses negative values in the calculation of all References
the variance components, and then sets all the Brennan, R. L. (2000). (Mis)conceptions about
negative values to zero for the estimated vari- generalizability theory. Educational
ance components at the end (Brennan, 2001). Measurement: Issues and Practice, 19(1), 5–10.
Brennan’s is the preferred approach. That said, Brennan, R. L. (2001). Generalizability theory. New
statistical methods other than ANOVA, such York: Springer Verlag.
as Bayesian analysis and maximum likelihood, Brennan, R. L., Harris, D. J., & Hanson, B. A.
(1987). The bootstrap and other procedures for
can be used to avoid the problem of negative
examining the variability of estimated variance
estimates. components in testing contexts (ACT Research
Standard Error Estimates Report Series 87–7). Iowa City, IA: American
College Testing Program.
One critical concept in interpreting the results Brennan, R. L., & Lee, W. -C. (1999). Conditional
from G and D studies is the estimated standard scale-score standard errors of measurement
error. Like the standard error of measurement under binomial and compound binomial
in classical reliability theory, the standard assumptions. Educational and Psychological
error estimates based on relative or absolute Measurement, 59, 5–24.
measurement errors can inform researchers Cronbach, L. J. (1951). Coefficient alpha and the
about the confidence interval for individuals’ internal structure of tests. Psychometrika, 16(3),
297–334.
universe scores. Results of G and D studies
Cronbach, L. J., Gleser, G. C., Nanda, H., &
can be interpreted meaningfully through the Rajaratnam, N. (1972). The dependability of
comparison between the interval range and behavioral measurements. New York: Wiley.
the score scale. Furthermore, the standard Cronbach, L. J., Linn, R. L., Brennan, R. L., &
error of the variance components should be Haertel, E. H. (1997). Generalizability analysis
GENERALIZED ANXIETY DISORDER 1341