Sie sind auf Seite 1von 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/313966026

Generalizability Theory

Chapter · January 2015


DOI: 10.1002/9781118625392.wbecp352

CITATIONS READS
0 809

4 authors, including:

Richard J. Shavelson
Stanford University
260 PUBLICATIONS   17,911 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Positive Learning in the Age of Information (PLATO) View project

G Theory View project

All content following this page was uploaded by Richard J. Shavelson on 28 December 2017.

The user has requested enhancement of the downloaded file.


1322 GENERALIZABILITY THEORY

Caspi, A., & Moffitt, T. E. (2006). Risch, N., Herrell, R., Lehner, T., Liang, K., Eaves,
Gene-environment interactions in psychiatry: L., Hoh, J., … Merikangas, K. (2009). Interaction
Joining forces with neuroscience. Nature Reviews between the serotonin transporter gene
Neuroscience, 7, 583–590. (5-HTTLPR), stressful life events, and risk of
Caspi, A., Moffitt, T. E., Cannon, M., McClay, J., depression: A meta-analysis. Journal of the
Murray, R., Harrington, H., … Craig, I. W. American Medical Association, 301(23),
(2005). Moderation of the effect of 2462–2471.
adolescent-onset cannabis use on adult psychosis van Winkel, R., Stefanis, N. C., & Myin-Germeys, I.
by a functional polymorphism in the (2008). Psychosocial stress and psychosis: A
catechol-O-methyltransferase gene: Longitudinal review of the neurobiological mechanisms and
evidence of a gene X environment interaction. the evidence for gene-stress interaction.
Biological Psychiatry, 57, 1117–1127. Schizophrenia Bulletin, 43, 1095–1105.
Caspi, A., Sugden, K., Moffitt, T. E., Taylor, A., Wong, M. Y., Day, N. E., Luan, J. A., Chan, K. P., &
Craig, I. W., Harrington, H., … Poulton, R. Wareham, N. J. (2003). The detection of
(2003). Influence of life stress on depression: gene-environment interaction for continuous
Moderation by a polymorphism in the 5-HTT traits: Should we deal with measurement error by
gene. Science, 301(5631), 386–389. bigger studies or better measurement?
Cicchetti, D., & Rogosch, F. A. (2012). Gene × International Journal of Epidemiology, 32(1),
environment interaction and resilience: Effects of 51–57.
child maltreatment and serotonin, corticotropin
releasing hormone, dopamine, and oxytocin Further Reading
genes. Development and Psychopathology, 24, Caspi, A., Hariri, A. R., Holmes, A., Uher, R., &
411–427. Moffitt, T. E. (2010). Genetic sensitivity to the
Karg, K., Burmeister, M., Shedden, K., & Sen, S. environment: The case of the serotonin
(2011). The serotonin transporter promoter transporter gene and its implications for studying
variant (5-HTTLPR), stress, and depression complex diseases and traits. American Journal of
meta-analysis revisited: Evidence of genetic Psychiatry, 167, 509–527.
moderation. Archives of General Psychiatry, Monroe, S., & Reid, M. (2008). Gene-environment
68(5), 444–454. interactions in depression research: Genetic
Kim-Cohen, J., Caspi, A., Taylor, A., Williams, B., polymorphisms and life-stress polyprocedures.
Newcombe, R., Craig, I. W., & Moffitt, T. E. Psychological Science, 19(10), 947–956.
(2006). MAOA, maltreatment, and
gene-environment interaction predicting
children’s mental health: New evidence and a
meta-analysis. Molecular Psychiatry, 11,
903–913. Generalizability Theory
Kim-Cohen, J., & Turkewitz, R. (2012). Resilience
and measured gene-environment interactions. Min Li,1 Richard J. Shavelson,2,3
Development and Psychopathology, 24, Yue Yin,4 and Ed Wiley2
1297–1306. 1 University of Washington, Seattle, U.S.A., 2 SK Partners,

Mehta, D., & Binder, E. B. (2012). Gene × U.S.A., 3 Stanford University, U.S.A., and 4 University of
environment vulnerability factors for PTSD: The Illinois, Chicago, U.S.A.
HPA-axis. Neuropharmacology, 62, 654–662.
Meyer-Lindenberg, A., & Weinberger, D. R. (2006). Generalizability (G) theory is a psychome-
Intermediate phenotypes and genetic
tric theory based on a statistical sampling
mechanisms of psychiatric disorders. Nature
Reviews Neuroscience, 7, 818–827.
approach that partitions scores into their
Pelayo-Teran, J. M., Suarez-Pinilla, P., Chadi, N., & underlying multiple sources of variation. G
Crespo-Facorro, B. (2012). Gene-environment theory was initially introduced by Cronbach
interactions underlying the effect of cannabis in and his colleagues (Cronbach, Rajaratnam, &
first episode psychosis. Current Pharmaceutical Gleser, 1963) as an extension of Cronbach’s
Design, 18, 5024–5035. classic paper on internal-consistency alpha
GENERALIZABILITY THEORY 1323

reliability (Cronbach’s alpha; Cronbach, 1951) it provides an estimate of the consistency of


and further developed by others (Brennan, such generalization by examining the extent to
2000, 2001; Cronbach, Gleser, Nanda, & which scores vary due to person and to various
Rajaratnam, 1972; Shavelson & Webb, 1981, sources of measurement error.
1991; Webb, Shavelson, & Haertel, 2006). It Four sets of interrelated concepts are criti-
addresses the consistency or dependability cal for understanding G theory: (a) universe
of score interpretations by recognizing and defined by facet(s), (b) partitioning of measure-
estimating the magnitude of measurement ment error, (c) relative and absolute decisions,
error from multiple sources. and (d) generalizability study and decision
This entry describes the theoretical foun- study. In what follows, these concepts are intro-
dation and applications of G theory for an duced by situating them in a study of measures
audience of clinical psychologists, not psy- of disaster survivors’ psychological impair-
chometricians. An example is introduced to ment (Gleser, Green, & Winget, 1978). Twenty
discuss the basic concepts of G theory, an adult survivors from a dam disaster were inter-
overview of G theory, and its applications. viewed independently by two interviewers.
One-facet and multifacet designs are then Each interview was then independently scored
described, within which all the basic con- by three raters for the extent of long-term
cepts introduced will be presented along with psychiatric impairment on three subdimen-
examples. The entry concludes with a summary sions (i.e., anxiety, depression–isolation, and
of important topics around the use of G theory hostility–belligerence) and one overall scale,
and calls attention to additional resources. based on the Psychiatric Evaluation Form with
a scale from 1 (none) to 6 (extreme).
Conceptual Underpinnings of G
Universe Defined by Facet(s)
Theory
G theory addresses the fact that scores assigned
G Theory as a Sampling Approach to persons vary not only due to person
When using a psychological or behavioral differences in the personality, behaviors, symp-
instrument to measure phenomena of inter- toms, abilities, or skills measured, but also
est (e.g., bullying, anxiety, racial identity, due to various sources of measurement error
belonging, or learning disability), clinical psy- associated with “facets” of the measurement
chologists typically focus on obtaining a stable, (e.g., interviewers and scorers in the example).
consistent estimate of a patient’s level on that The variance attributable to person differences
scale, for diagnosis of severity and treatment. is assumed to be “wanted” or “expected” score
The core of G theory lies in the notion that variation—a person’s universe scores. A uni-
an instrument can likely include only a small verse score is defined as an expected value of
sample of items or tasks from the much larger all the observations on a person across the
universe of items or tasks that could have been universe of possible options; it is equivalent to
included instead. Our ultimate goal in admin- a true score in classical test theory. Universe
istering a test is to gain enough information or universe of generalization is then defined by
from this limited sample to generalize across specifying admissible facets of a measurement
the larger universe. Our success in generalizing without changing the construct of interest, and
from limited sample to broad universe depends levels for each facet that can be considered
on the nature and magnitude of error encoun- exchangeable or parallel. Facet refers to each
tered due to our ability to sample only a small of the characteristics (factors) used to generate
part of the targeted universe. G theory provides or define the universe of observations, such
an approach for pinpointing magnitudes of as test item, survey form, rater, and occasion.
measurement error that make the generaliza- Each facet has a set of varying yet comparable
tion from sample to universe hazardous, and levels or categories, called levels or conditions.
1324 GENERALIZABILITY THEORY

A particular measurement procedure is then represent the fact that, over all 20 interviews,
a combination of specific facet conditions one interviewer tended to elicit informa-
sampled from such a universe. tion indicating systematically higher levels of
In the survivor study (Gleser et al., 1978), impairment than did the other interviewer.
three raters scored the interviews of 20 sur- G theory, supported by statistical procedures,
vivors conducted by two interviewers. Making allows simultaneous estimation of the variance
use of two facets, interviewer and rater, this due to each source, which is called a variance
study is a two-facet design of survivor (s), component.
interviewer (i), and rater (r), denoted s × i × r.
The universe of generalization is constituted Relative Decisions and Absolute
by an infinite number of interviewers and by Decisions
all the raters who can quantify the interviews. In G theory, the reliability coefficient is defined
A universe score for a particular survivor is as the proportion of score variance attributable
then the expected value of all the admissible to person with respect to the total variance
observations of this person from the universe due to both person and various sources of
of generalization. The particular measurement measurement error. G theory recognizes that
sampled from the universe contains two levels decision makers using a measurement, such as
at the interviewer facet and three levels at the psychologists, clinicians, researchers, parents,
rater facet. The interviewers and the raters policy makers, and managers, may want to
defined in the universe are not simply any make two types of decisions with the scores:
random interviewers or raters. Instead, the relative and absolute. Corresponding to these
universe of interviewers should be regarded two types of decisions, different G coefficients
as comparable to the two study interviewers can be computed.
concerning, for example, the type and amount Relative decisions are associated with
of training received, and the interview prompts norm-referenced interpretations of scores.
used. The raters should be comparable to the The decisions concern the consistency of
three raters sampled in the study in terms scores in ranking or sorting individuals
of, for example, scoring experience, type and according to differences in their personal-
amount of training received, and familiarity ity, behaviors, knowledge, and/or skills. For
with the measured construct. example, relative decision is involved when
Partitioning of Measurement Error we examine whether different items measur-
G theory offers a framework to conceptualize ing ability of matching familiar figures rank
and estimate multiple sources of score variance. patients consistently for impulsivity level, or
One critical step in applying G theory is to par- whether scores from different interviewers
tition the variance of observed scores into the judged by different raters sort survivors con-
variance due to person and the measurement sistently for their psychiatric impairment. In
error due to the main and interaction effects of contrast, absolute decisions are associated with
multiple sources (facets). criterion-referenced or domain-referenced
In the survivor study, the survivor is the interpretations of scores. They concern index-
object of measurement, and measurement ing the exact or absolute level of individuals’
error may arise from inconsistencies due to construct for a given domain regardless of
the two facets (interviewer and rater). More the performance of others, such as driving
specifically, the measurement error can be tests in everyday life. The measurement deci-
attributed to sampling errors with inter- sions are made based on the absolute level
viewers, raters, their interactions, and other of scores from varying items regardless of
unspecified sources. For instance, measure- the relative ranking among the measured
ment error associated with interviewers might persons.
GENERALIZABILITY THEORY 1325

Generalizability Studies and Decision the fact that only some of the sampled levels
Studies of a given facet are present for each level of
G theory distinguishes between generalizability the other facet(s). The survivor study is a fully
(G) studies and decision (D) studies. The pur- crossed s × i × r design as this measurement
pose of a G study is to estimate components of procedure has all the persons interviewed by
score variance associated with various sources. the two interviewers and then scored by four
A D study takes these estimated variance com- raters. But suppose that, for practical reasons,
ponents to evaluate and optimize among alter- each participant is interviewed by different sets
natives for subsequent measurement. of two interviewers using the same interview
Let us turn again to the survivor study. protocol, but the interview responses are ana-
Variance components can be estimated from lyzed by four raters. In this case, interviewer
the G study. They indicate the magnitude is nested within survivor (which is denoted as
of estimated variance associated with the i : s, where “:” is read as “nested with”) because
measurement procedure using only one not all of its sampled levels in the interviewer
interviewer and one rater, which is similar facet are represented with scores for all the
to a single-measure intraclass correlation survivors. This particular design, denoted as
coefficient. These magnitudes-of-variance (i : s) × r, considers the following five sources of
components then form the basis for a series of score variation: s; r; i : s; sr; and ir : s, e. Some-
D studies to make decisions for future measure- times, person as the object of measurement
ment procedures as the sampling of some facets can be nested within a facet. For example, a
is changed. For instance, one might be inter- researcher may examine the use of an obser-
ested in reducing the number of interviewers vation protocol on individuals’ interaction
but increasing the number of raters to maintain with peers in the group therapy setting where
the reliability of the measurement of psychic clients are nested within a group observed by
injury (n′i = 1 and n′r = 5 instead of ni = 2 and two raters on five occasions. This design can
nr = 3). Variance components and reliability be represented as (Person : Group) × Rater ×
coefficients can be calculated to determine Occasion.
whether this hypothesized procedure meets Second, designs can be random or mixed
the reliability requirement and identify the models. Random models have only random
minimum number of interviewers and the facets, whereas mixed models have both ran-
minimum number of raters needed to obtain dom and fixed facets. G theory is essentially
dependable scores. a random effects theory. Typically, a random
facet is generated if the levels of the studied
Designs Used in G and D Studies measurement are randomly sampled from all
There are a variety of designs that decision the levels of a facet. A facet is also considered as
makers can select from when conducting their random when the intended universe of gener-
G and D studies. These designs are specified alization for this facet is infinitely large and the
with respect to three important properties levels included in the universe are exchange-
of facets. They are pertinent to theoriza- able to the selected levels even though these
tion of sampling the universe and therefore levels have not been randomly sampled, for
generalization of score interpretations. example, the facets of interviewers and raters.
First, designs can be crossed or nested when In contrast, a decision maker may consider
facets are either crossed or nested with the a facet as fixed when (a) the universe of gen-
object of measurement (person, in most cases) eralization is rather limited or finite, so that
or other facets. Crossed refers to the designs all the levels of the facet are included in the
in which all the sampled levels of any facet studied measurement; or (b) the studied levels
are present for all persons and for all levels of involved in the measurement are deliberately
any other facet(s). In contrast, nested refers to selected based on theoretical reasons, and the
1326 GENERALIZABILITY THEORY

researcher is not interested in generalizing to languages, modes, layouts, paper color, or


any other levels. Suppose that a school district occasions.
collaborates with a counseling psychologist In the second scenario, in which a hidden
to introduce and examine the reliability of facet is confounded with another source of
a self-efficacy measure that involves judg- variance (e.g., the object of measurement),
ments made by raters with students’ responses. the levels of this hidden facet vary as the
The district uses three well-trained special- levels of the other source of variance vary.
ists as raters. In this example, rater is likely For example, the facet of item order is often
treated as a fixed facet because the district confounded with the item facet in a typical
does not plan to expand the pool of raters p × i design, as items appear in different orders
beyond the set of three individuals who have (or the respondents may choose to complete
been trained by the psychologist. However, the measurement in different orders). As
if the district anticipated recruiting more another example, when interview questions
staff members as raters in the future, then elicit persons’ personal accounts to describe
rater would have to be treated as a random their perceived stress, the selected account
facet. or anecdote becomes a hidden facet. It is
A third characteristic of designs deals with confounded with the object of measurement
hidden or implicit facets. Hidden facets stem (person) because individuals may recall very
from the fact that (a) only one particular different anecdotes, which makes it impossible
level of the facet is represented in a G study to disentangle the main effect of person and
design with a clearly articulated goal of making the anecdote effect. Again, the estimated coef-
the measurement procedure manageable and ficients from the data of this design could lead
controlling other facets that may contribute to an overestimation of generalizability when
to the measurement error; or (b) the facets a decision maker wants to generalize over
are confounded with other facets or objects of anecdotes, because the design fails to consider
measurement but are not recognized. In both the possibility that persons’ stress scores might
vary on other levels of the anecdote selection
scenarios, the hidden facets are not explicitly
facet.
planned, recognized, or represented in the
Differences in designs lead to variation in
designs. An example of the first scenario is
procedures used to decompose and estimate
that when persons respond to the impulsivity
the variance components, and subsequently
test, the items are administered in English,
calculate the measurement error and reli-
in a paper-and-pencil mode, presented in a
ability coefficients. The next two sections
particular layout and color of paper, and on
focus on describing designs that consist of
a particular occasion. The facets of language,
one or multiple facets that are crossed or
mode, layout, paper color, and occasion thus
nested.
are hidden because only one level of each is
represented in the design, although they are not One-Facet Design
explicitly specified. It is impossible to estimate G theory operationalizes a measurement as a
the amount of measurement error attributable procedure that can result in a sample from a
to hidden facets (Cronbach, Linn, Brennan, universe of admissible observations. Studies of
& Haertel, 1997; Shavelson, Ruiz-Primo, & one-facet design involve only one facet, and the
Wiley, 1999). The estimated coefficients from universe is defined by a single random facet. If
such data do not take into account the pos- a psychologist plans to generalize from clients’
sibility that persons’ scores might vary on responses on one occasion to a much larger set
other levels of any of those hidden facets and, of occasions, occasion is a facet. The universe
consequently, overestimate generalizability would be defined by all admissible occasions
when researchers want to generalize over (e.g., every morning). If a manager wants to
GENERALIZABILITY THEORY 1327

generalize from employees’ performance rated one-facet crossed design and one-facet nested
by two raters to a much larger set of raters, design.
rater is a facet. The universe would be defined
One-Facet Crossed Design
by all admissible raters (e.g., all the raters who
have the expertise to rate the employees). In the one-facet crossed design, each person is
Using the impulsivity test as an example, given the same random sample of items. The
decision makers are typically indifferent to design is denoted as p × i, where p refers to
the particular test score on one set of items. person (or examinee), i refers to item, and ×
Instead, they would be quite willing to sub- means “crossed with.” The statistical model
stitute another set of items to measure the used to estimate variance components in the
construct that they are interested in. In other one-facet crossed design is described next.
words, the decision makers are more interested Statistical model. The observed score for
in an examinee’s general responses to any items any person on any item is denoted as Xpi . The
from the universe of generalization rather than universe score, denoted as μp , is defined as
that examinee’s score on any particular set of a person’s average score over the entire item
items. universe, Ei (Xpi ), where the symbol E is an
A one-facet p × i design has four sources expectation operator and the subscript i desig-
of variability. The first source is due to the nates that the expectation is taken over items.
systematic differences among persons, the Similarly, μi is the population mean for item
object of measurement. It reflects individuals’ i, defined as Ep (Xpi ), the expected value of Xpi
personality, attitudes, behaviors, performance, over persons. μ is the grand mean over both
knowledge, skills, and so on. The second the population and universe, Ep Ei (Xpi ).
source of variability is due to the difference in An observed score for one person on one
test items. Some items are easy, some difficult, item (Xpi ) can be decomposed as follows:
and some in between for all the individuals.
The third source of variability is due to the Xpi = μ[grand mean]
interaction between individuals and items.
+ (μp − μ) [person effect]
For example, an item testing individuals’ skills
of addition in the context of money would + (μi − μ) [item effect]
be easier for a student who has received an
+ (Xpi − μp − μi + μ)[person × item
allowance from parents than one who has
not. The fourth source of variability, which interaction confounded with residual]
is usually confounded with the third source, (1)
may be due to randomness (e.g., lucky guesses
or careless mistakes), other systematic but Each effect other than the grand mean in
unidentified or unknown sources of variability Equation 1 has a distribution with a mean
(e.g., individuals taking the test on different of zero and its own variance σ2 , called the
occasions or in different classroom settings), variance component. For example, the vari-
or both. ance component for person effect is defined as
Once the magnitude of each source of score σ2p = Ep (μp − μ). 2 That is, it is the average of
variation is estimated, the measurements the squared deviations of the persons’ universe
already made in the G study can be evaluated. scores from the grand mean. The variance
In addition, D studies can be conducted to component for items can be defined similarly
redesign measurements in order to reach a as σ2i . The final effect, the p × i interaction
desirable reliability (e.g., lengthening the test confounded with residual, also has a mean
and hiring more raters). For a one-faceted of zero and a variance denoted as σ2pi,e . This
universe, two designs could be used in G and effect includes the p × i interaction effect, which
D studies to estimate variance components: the reflects the fact that not all people find the same
1328 GENERALIZABILITY THEORY

items easy or difficult, confounded with unsys- variance in item difficulty, which influences
tematic variability and unspecified variability everybody, does not influence the ranking
from the hidden facets not explicitly included of individuals. The dependability coefficient
or controlled in the one-facet G study because (Φ) is calculated as σ2p ∕(σ2p + σ2Abs. ) for abso-
with only one observation per cell in the per- lute decisions. Here, all variance components
son × item data matrix, the interaction cannot except the object of measurement contribute to
be disentangled from “within-cell” variability. the absolute measurement error, that is, σ2i and
In sum, the variance of the collection of σ2pi,e in this one-facet crossed design. Usually,
observed scores, Xpi ; overall persons; and items the generalizability coefficient is larger than or
in the universe is expressed as the sum of the equal to the dependability coefficient.
three variance components: Numerical example for the G study. Let’s use
a specific example to illustrate how to estimate
σ2X = σ2p + σ2i + σ2pi,e (2) variance components for the one-facet crossed
pi
design. Suppose that eight items were given
That is, the variance of item scores can be par- to 10 individuals, and the data are displayed
titioned into independent sources of variation in Table 1. Items are regarded as random
due to differences between persons as objects of samples from the infinite universe of all items,
measurement, σ2p , and two sources of measure- and persons are randomly sampled from the
ment error: variation due to item, σ2i , and vari- population of individuals.
ation due to the p × i interaction confounded Analysis of variance (ANOVA) can be used
with the residual, σ2pi,e . to estimate the variance components. Several
In the one-facet crossed design, the gen- statistical packages are available for conducting
eralizability coefficient (ρ2 ), calculated as G and/or D studies based on the ANOVA
σ2p ∕(σ2p + σ2Rel. ), pertains to relative decisions. approach: (a) GENOVA and urGENOVA
Specifically, the variance component, σ2pi,e , (along with mGENOVA), authored by Bren-
contributes to relative measurement error nan as a family of programs; (b) the Variance
because the interaction between person and Component in SPSS; and (c) Proc Varcomp
item influences the relative standing of indi- in SAS. The GENOVA suite of computer pro-
viduals. In contrast, the variance component grams can run G and D studies for a variety of
due to item effect, σ2i , does not contribute designs when facets have different properties
to relative measurement error because the and missing values may be present. In contrast,

Table 1 Item scores on eight items of 10 examinees (modified from Brennan, 2001, p. 28).

Item Person
Person 1 2 3 4 5 6 7 8 Mean

1 1 0 1 0 0 0 0 0 0.250
2 1 1 1 0 0 1 0 0 0.500
3 1 1 1 1 1 0 0 0 0.625
4 1 1 0 1 1 0 0 1 0.625
5 1 1 1 1 1 0 1 0 0.750
6 1 1 1 0 1 1 1 0 0.750
7 1 1 1 1 1 1 1 0 0.875
8 1 1 1 1 0 1 1 1 0.875
9 1 1 1 1 1 1 1 1 1.000
10 1 1 1 1 1 1 1 1 1.000
Item mean 1 0.9 0.9 0.7 0.7 0.6 0.6 0.4 0.725
GENERALIZABILITY THEORY 1329

Table 2 ANOVA estimates of variance components for the one-facet crossed design data in Table 1.

Source of Sum of df Mean Variance component Estimated variance


variation squares squares calculation component

Person (p) 3.950 9 0.439 (MSp − MSpi, e )/ni 0.0365


Items (i) 2.750 7 0.393 (MSi − MSpi, e )/np 0.0246
Interaction with residual (pi,e) 9.250 63 0.147 MSpi, e 0.1470

the SPSS or SAS programs can only produce D studies. It is important to notice that the
results for G studies; users then have to use variance components estimated in Table 2
the variance components estimated by the two reflect the variance of a one-item test; there-
programs to calculate parameters for D studies. fore, the generalizability coefficients calculated
Two-way ANOVA, with score as the are associated with scores from the one-item
dependent variable, and person and item test. It is not surprising that the coefficients
as independent variables, is conducted to are rather low. If multiple items (n′i ) are
calculate mean squares (MSs) for persons, given to the examinees, the magnitude of
items, and residual. Meanwhile, the MSs can be variance components and coefficients will
expressed as the expected mean square (EMS), be improved. D studies can be conducted
weighted sums of variance components, as to estimate how the magnitude of variance
follows (for derivations of EMS equations, see components and the corresponding general-
Kirk, 1982): izability coefficients vary by the number of
items.
EMSp = σ2pi,e + ni σ2p From the central limit theorem in statis-
EMSi = σ2pi,e + np σ2i tics, it is well known that the variance
of mean scores is the variance of indi-
EMSpi,e = σ2pi,e vidual scores divided by the sample size
of scores from which the mean is calcu-
These equations can be used to estimate each lated. Therefore, the variance component
variance component, after the EMS is replaced associated with n′i items, when the score
with corresponding observed MSs and the interpretation is based on the mean of
exact value of σ2 is replaced with estimated scores from n′i items, can be estimated as
σ2 , ̂
σ2 . The equations can be solved in order to follows:
estimate each variance component as follows:
σ2i′ = σ2i ∕n′i
̂
σ2pi,e = MSpi,e
σ2pi,e′ = σ2pi,e ∕n′i
̂
σ2p = (MSp − ̂
σ2pi,e )∕ni = (MSp − MSpi,e )∕ni
In the example discussed earlier, if 10 items
̂
σ2i = (MSi − ̂
σ2pi,e )∕np = (MSi − MSpi,e )∕np rather than one item are used, both σ2i and
Table 2 presents the ANOVA result and σ2pi,e will become one tenth of the original
the variance components estimated by using ones; that is, σ2i′ = 0.00246 and σ2pi,e′ = 0.0147.
the formulas in this subsection. Based on the Correspondingly,
estimated variance components, ρ2 for rela-
tive decision is calculated as σ2p ∕(σ2p + σ2pi,e )= ρ2 = σ2p ∕(σ2p + σ2pi,e ∕n′i )
0.0365/(0.0365 + 0.1470) = 0.199. Φ for abso- = 0.0365∕(0.0365 + 0.147∕10)
lute decision is calculated as σ2p /(σ2p +σ2i +σ2pi,e )
= 0.0365 / (0.0365 + 0.0246 + 0.1470) = 0.175. = 0.713
1330 GENERALIZABILITY THEORY

Statistical model. In the nested design,


Φ = σ2p ∕(σ2p + σ2i ∕n′i + σ2pi,e ∕n′i ) because each person is given a different set or
group of items, effects attributable solely to
= 0.0365∕(0.0365 + 0.0246∕10
items are indistinguishable from interaction
+ 0.147∕10) = 0.680 and other random effects. Suppose that the data
in Table 1 are collected from a nested design;
Compared with the coefficients associated the column means in Table 3 then do not exist,
with one item, the error variances are reduced because persons’ scores in each column are on
and the coefficients are higher. This is anal- different items, and it is meaningless to take an
ogous to the Spearman–Brown formula in average for each item across persons. There-
classical theory. That is, the decision maker fore, the item effect does not exist in the linear
can reduce error variance by increasing the model of nested design. When the one-facet
number of items and then using the aver-
design is nested, an observed score for one
age (or sum) over the items as the person’s
person on one item (Xpi ) can be decomposed
score. Similarly, by setting up the desirable
as follows:
generalizability coefficient or dependability
coefficient, the decision maker can estimate Xpi = μ [grand mean]
and determine the value of n′i (the needed
length of the test) to maintain the techni- + (μp − μ) [person effect]
cal requirements of the measurement. For + (Xpi − μp ) [nested effect confounded
example, if a teacher wants ρ2 to reach 0.80
in the example discussed, she can solve the with residual] (3)
following equation to estimate the number Correspondingly, the variance of the collec-
of items required: ρ2 = σ2p ∕(σ2p + σ2pi,e ) = tion of observed scores, Xpi , for overall persons
0.0365∕(0.0365 + 0.147∕n′i ) = 0.80. Therefore, and items in the universe is expressed as the
n′i = 16.1, so at least 17 items are needed for ρ2 sum of the two variance components:
to be 0.80.
σ2X = σ2p + σ2i∶p,e (4)
One-Facet Nested Design pi

In a one-facet nested design, each person The ANOVA procedure can then be used to
is given a different set of random items. calculate the MS and then estimate the variance
For instance, in the example of item as components.
the nested facet with 10 students, the eight Numerical example. To use space efficiently,
items given to each person are all dif- this section continues to use the data in Table 1,
ferent. That is, items 1 to 8 are given to but now imagine that different persons take dif-
person 1, items 9 to 16 are given to per- ferent samples of items. That is, let us assume
son 2, and so on until items 73 to 80 are that Items 1 to 8 are different sets for different
given to person 10. The design is denoted as persons. Table 3 presents the ANOVA results
i : p. and the estimated variance components.

Table 3 ANOVA estimates of variance for data in Table 1 (assuming that the data are from a one-facet
nested design instead of a one-facet crossed design).

Source of Sum of df Mean Variance component Estimated variance


variation squares squares calculation component

Person (p) 3.950 9 0.439 (MSp − MSresidual )/ni 0.0335


i:p, e 12.000 70 0.171 MSresidual 0.1710
GENERALIZABILITY THEORY 1331

Once variance components are estimated, ρ2 both crossed design and nested design used in
and Φ coefficients can be calculated. Unlike D studies will be described. In the descriptions,
a one-facet crossed design, errors for relative the basic concepts for G theory will be reviewed
decision and absolute decision are the same to articulate and stress the essence of applying
because error variance associated with item G theory in approaching the measurement
main effect cannot be distinguished from issues.
other sources of errors. Accordingly, ρ2 = Φ = A multifacet universe is defined by more
0.0335 / (0.0335 + 0.1710) = 0.164. than one facet, which may be crossed or nested
D studies. Similar to D studies for a one-facet with the object of measurement (person) and
crossed design, when the number of items other facet(s), and may be random or fixed. If
increases, the error variance decreases and a psychologist plans to generalize from clients’
the coefficients of ρ2 and Φ increase. D responses from a set of questionnaire items on
studies allow decision makers to estimate one occasion to a larger set of items on a much
how the magnitude of coefficients changes larger set of occasions, item and occasion are
with the number of items and determine the the two facets involved. The universe would
number of items required to reach certain be defined by the infinite number of items
coefficients. on all admissible occasions. If a researcher
Because the variance of the person effect (σ2p ) wants to generalize from toddlers’ interaction
is the variance of universe score that will not with their peers rated by two raters on eight
change, the D study only needs to calculate the indicators to a much larger set of raters on a
variance component for the nested effect con- much larger set of indicators, then both rater
founded with the residual when using n′i items. and indicator are facets that should be con-
The formula is σ2i∶p,e′ = σ2i∶p,e ∕n′i . sidered. The universe would be defined by all
Suppose that 10 items are given to examinees. admissible raters (e.g., all the raters who have
Coefficients for relative decision and absolute the expertise to rate the children) on all admis-
decision can be calculated as ρ2 = Φ = 0.0335 sible indicators of the construct of interest.
/ (0.0335 + 0.1710 / 10) = 0.662. Similar to the As described in these examples, studies that
solution in one-facet crossed design, to reach a involve two or more facets are called multifacet
coefficient of 0.80, the following equation needs designs.
to be solved as ρ2 = Φ = 0.0335 ∕ (0.0335 + Using the item and rater facets as a con-
0.1710∕n′i ) = 0.80. n′i is found to be 20.4; that crete example, there may potentially be a
is, at least 21 items are needed to reach coeffi- variety of designs that the psychologist can
cients of 0.80. consider for her study. Table 4 lists some
examples. Depending on the structure of facets
Multifacet Designs (e.g., how facets are crossed or nested), the
This section uses designs with two facets as variability of observed scores is partitioned
multifacet examples to describe the crossed into different main effects and interaction
and nested models used in G and D studies. effects, except the person effect as the object
Because the difference between crossed and of measurement (for most cases). The math-
nested designs and the statistical models have ematical calculations for these effects differ
been described already in a previous section as well, but this is not the focus of this entry
of this entry, this section focuses on the inter- (for additional explanation, see Brennan,
pretation of variance components involving 2001).
multiple facets without getting into detail about Among all the possible designs that a
the technical aspects and describing extensively researcher can select, the fully crossed design
each of the two different designs due to space (p × i × r) is most advantageous for the G study,
constraints. Instead, a G study example of a because other nested designs, or a crossed
crossed design will be introduced, and then design with different sampling of items and
1332 GENERALIZABILITY THEORY

Table 4 Examples of two-facet designs.

Variability due to object


of measurement Variability due to measurement error
Design (Person effect) Main effects Interaction effects

p×i×r p i, r pi, pr, ir, pir, e


p × (i : r) p r, i:r pr, pi:r, e
(i : p) × r p r, i:p pr, ir:p, e
i : (p × r) p r pr, i:pr, e
(i × r) : p p i:p, r:p ir:p, e

raters, can be examined in the D studies based can be calculated using the relative mea-
on the estimated variance components in the surement error and absolute measurement
G study. However, the utility of the nested error, respectively. The D studies involve a
designs in the D studies can be rather limited. decision-making process that considers the
For example, the design of either i : (p × r) or pattern of variance components associated
(i × r) : p can be considered in the D stud- with multiple facets, evaluates the options
ies when a G study employs the design of available by manipulating the properties of
p × i × r, but not vice versa, because an effect facets, and decides the optimum measurement
confounded with other effects in the G study procedure that elegantly balances the technical
cannot be decomposed subsequently in D stud-
requirements and the logistic cost. In what
ies. Therefore, if resources allow, researchers
follows, two designs, p × i × r for the multifacet
should consider using the fully crossed design
crossed design and p × (i : r) as an example of a
in the G studies as this design offers more flex-
multifacet nested design, are described to illus-
ibility for the designs in D studies. Given the
constraints imposed by resources, researchers trate the reasoning for partitioning the score
then need to specify the needed resources variability, estimating the generalizability and
in terms of items constructed, raters to be dependability coefficients, and conducting D
trained, and other logistics (e.g., recruiting studies.
raters or item developers). Assume that the A G Study Example Using the Multifacet
measurement procedure to be studied implies Crossed Design
10 items and two raters. For example, the
In the multifacet crossed design, each person is
design of p × (i : r) indicates that five items will
given the same random sample of items scored
be scored by Rater 1 and another five items
by the same sample of raters. The design is
by Rater 2, whereas the design of (i × p) : r
denoted as p × i × r, where p refers to person
indicates that each person will take a mea-
surement with a different set of 10 items, (or examinee) as object of measurement and i
but all the responses will be scored by two and r refer to item and rater facets respectively.
raters. Therefore, the selection of the study The statistical model used to estimate variance
design is largely determined by the intended components in this crossed design is described
generalizations in the D studies, as well as by below.
the resources available to conduct the data Statistical model The observed score for
collection. any person on any item given by any rater is
Similar to the one-facet design, the gen- denoted as Xpir . This observed score in the
eralizability and dependability coefficients two-facet crossed design can be decomposed
GENERALIZABILITY THEORY 1333

as follows: variance in this p × i × r design. A generaliz-


ability coefficient and dependability coefficient
Xpi = μ [grand mean] are then calculated based on these two types of
+ (μp − μ) [person effect] measurement error.
Numerical example for the G study. Let’s use
+ (μi − μ) [item effect] a specific example to illustrate how to esti-
+ (μr − μ) [rater effect] mate variance components for the multifacet
crossed design. Suppose that 10 open-ended
+ (μpi − μp − μi + μ) items to measure interpersonal skills were
[person × item effect] given to 50 individuals and responses were
scored by three raters using a 1–4 scale. Due
+ (μpr − μp − μr + μ) to space limitations, the full data set is not
[person × rater effect] included. Table 5 displays the data layout.
Because all the individuals are measured by all
+ (μir − μi − μr + μ) [item × rater effect] the levels of facets (i.e., 10 items and three
+ (Xpir − μpi − μpr − μir raters), item and rater are crossed facets.
The observations are regarded as randomly
+ μp + μi + μr − μ) sampled from the universe of generalization
as scores on all the admissible items rated
[person × item × rater effect confounded
by admissible raters.
person × item × residual] (5) Based on the MS results of ANOVA, each
variance component can be estimated (Table 6;
Given the main effects and interaction effects estimating variance components based on the
in Equation 5, the variance of observed scores ANOVA method is described in the previous
can be decomposed as the sum of seven vari- section, and for additional explanations, see
ance components, including person variance as Kirk, 1982). The relative magnitudes of the
the object of measurement and measurement estimated variance components, except for
errors accounted by different sources: person variance, provide information about
potential sources of measurement error that
σ2X = σ2p + σ2i + σ2r + σ2pi + σ2pr + σ2pir,e . (6)
pir influence this psychological measure of inter-
personal skills. The person variance due to
That is, the variance of item scores can be
partitioned into independent sources of varia- differences in universe scores (σ2p = 0.1884)
tion due to differences between persons as the is relatively large compared to other compo-
object of measurement, σ2p , and six sources of nents, accounting for 28% of total variation
measurement error: variation due to two main even when only one item and one rater are
effects, σ2i and σ2r ; variation due to two-way sampled from the universe. The other larger
interactions, σ2pi , σ2pr , and σ2ir ; and σ2pir,e , the components are two interaction effects, σ2pir,e
p × i × r three-way interaction confounded and σ2pi , accounting for 38% and 32% of score
with the unsystematic or unspecified residual. variation, respectively. Specifically, σ2pir,e as
For relative decisions, all variance compo- the largest variance component reflects that
nents that influence the relative standing of some individuals’ scores can be relatively low
individuals contribute to relative measurement or high depending on certain items graded by
error; these include σ2pi , σ2pr , and σ2pir,e . For certain raters; in other words, persons’ scores
absolute decisions, all variance components of interpersonal skills vary across items and
except the object of measurement contribute raters. The large σ2pi suggests that the relative
to the absolute measurement error; that is, all standings of individuals’ performance vary
the variance components except the person across items. The σ2i only accounts for 2% of
Table 5 Item scores for crossed Person × Item × Rater G study.

Item
1 2 3 … 10
Person Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater Rater
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

1 4 4 4 1 1 2 2 2 3 3 4 4
2 2 3 3 2 2 2 1 2 1 2 2 2
3 4 4 4 1 1 1 1 2 2 3 3 4

N = 50 3 4 3 1 2 1 2 2 1 4 4 4
GENERALIZABILITY THEORY 1335

Table 6 Estimated variance components in the p × i × r example.

Source of variability Symbol of variance Estimated variance Total variability (%)


component component

Person (p) σ2p 0.1884 28


Item (i) σ2i 0.0111 02
Rater (r) σ2r 0.0000 00
p× i σ2pi 0.2132 32
p× r σ2pr 0.0028 00
i× r σ2ir 0.0014 00
p × i × r, e σ2pir,e 0.2589 38

the score variance, which is a rather negligible Examples of D Studies Using the Crossed
source of measurement error. This shows that, and Nested Designs
despite the fact that some items more likely Based on the estimated variance components in
lead to slightly higher scores for all individuals Table 6, more items and raters are apparently
than other items, this unevenness in item needed in order to increase the generalizabil-
difficulty contributes little to the overall score ity or dependability of score inferences. Assume
variation when other variations are taken into that the D studies still use the crossed design
consideration. Finally, other components (i.e., with both facets as random. Table 7 reports the
σ2r , σ2pr , and σ2ir ), all of which are associated estimated variance components, measurement
with the rater facet, appear not to contribute errors, and generalizability and dependability
much to the score variability. This provides coefficients when the number of items is 5, 10,
evidence that the training of raters and the or 15 and the number of raters is 2, 3, or 4. As
use of scoring systems minimize the potential more items and more raters are included in a
measurement error from the rater main effect measurement procedure, smaller measurement
and other related interaction effects. error and higher coefficients result.
After pinpointing relative and absolute mea- Alternatively, the results from D studies can
surement errors, the generalizability coefficient be presented in figures. For instance, Figure 1
(ρ2 ) and dependability coefficient (Φ) are to be shows the dependability coefficients for given
calculated when only one item and one rater are numbers of items (n′i ranging from 2 to 14)
sampled from the universe of admissible obser- and raters (n′r ranging from 2 to 4). Table 7 and
vations as follows: Figure 1 suggest that the benefit of increasing
items in order to reduce the amount of error
ρ2 = σ2p ∕(σ2p + σ2Rel. ) variance starts to diminish at approximately
nine items, and the benefit of having more
= σ2p ∕(σ2p + σ2pi + σ2pr + σ2pir,e )
raters reaches a plateau at three raters.
= 0.1884∕(0.1884 + 0.2132 + 0.0028 The results of D studies also make it evi-
dent that the development, specification, and
+ 0.2589) = 0.284
evaluation of new measurement procedures
Φ = σ2p ∕(σ2p + σ2Abs. ) = σ2p ∕(σ2p + σ2i + σ2r can be handled systematically by the G theory
approach. For example, in order to maintain
+ σ2pi + σ2pr + σ2ir + σ2pir,e ) a dependability coefficient at 0.80, a measure-
= 0.1884∕(0.1884 + 0.0111 + 0 + 0.2132 ment procedure of using eight items and two
raters or seven items with three raters can
+ 0.0028 + 0.0014 + 0.2589) = 0.279 be considered. Depending on the financial
and human resources, a decision maker can
Table 7 Estimated variance components (EVCs) for p × i × r random effects design based on results of the G study in Table 6.

Source of EVC EVC in D studies


variability
in G n′i = 5 10 15 5 10 15 5 10 15
study n′r = 2 2 2 3 3 3 4 4 4

Person (p) 0.1884 σ2p 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188
Item (i) 0.0111 σ2i /n′i 0.002 0.001 0.001 0.002 0.001 0.001 0.002 0.001 0.001
Rater (r) 0.0000 σ2r /n′r 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
p×i 0.2132 σ2pi /n′i 0.043 0.021 0.014 0.043 0.021 0.014 0.043 0.021 0.014
p×r 0.0028 σ2pr /n′r 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
2
i×r 0.0014 σir /(n′i n′r ) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
p × i × r, e 0.2589 σ2pir,e /(n′i n′r ) 0.026 0.013 0.009 0.017 0.009 0.006 0.013 0.006 0.004
Relative measurement error (σ2Rel. ) 0.070 0.035 0.024 0.061 0.031 0.021 0.057 0.028 0.019
Absolute measurement error (σ2Abs. ) 0.072 0.036 0.025 0.063 0.032 0.022 0.059 0.030 0.020
Generalizability coefficient (ρ2 ) 0.729 0.843 0.887 0.755 0.858 0.900 0.767 0.870 0.908
Dependability coefficient (Φ) 0.723 0.839 0.883 0.749 0.855 0.895 0.761 0.866 0.904
GENERALIZABILITY THEORY 1337

0.90

0.85

Dependability coefficient (φ)


0.80

0.75
n=2
n=3
n=4
0.70

0.65

0.60

0.55

2 4 6 8 10 12 14
Number of items

Figure 1 Dependability coefficients for given number of items and raters.

compare these measurement procedures and and the last set of 10 items by Rater 3. For the
select the more efficient one with adequate two random facets, item is a nested facet and
psychometric requirements. rater is a crossed facet. In the i : r : p design, both
As explained earlier, a decision maker may facets are nested facets. This design requires
choose the nested design to plan the possible each individual to be tested by a different set
measurement procedures in the D studies of 30 items and scored by a different group of
given the logistic considerations. In this case, three raters while each rater only graded 10
each person does not have scores from all the items. Based on results of the crossed G study
levels of facets. There are many variations of from Table 6, Table 8 presents the estimated
nested designs, depending on the relationship variance components of a series of D studies
between the object of measurement and the using the two nested designs. Similar to the D
facets, for example p × (i : r) or i : r : p. Each studies for one-facet designs, the error variance
requires a different procedure to compile
can be reduced and the coefficients of ρ2 and
test items and arrange the scoring. In what
Φ are increased when more levels of item and
follows, these two nested designs are used
rater facets are sampled from the universe.
to illustrate the conceptual understanding for
conducting D studies and interpreting variance Depending on the magnitudes of variance
components. components reported in the G studies, the
Again, 10 items and three raters are the patterns of estimated variance components for
measurement conditions sampled from the different designs can differ dramatically or can
universe. In the p × (i : r) design, each indi- be similar, as shown in the two examples in
vidual would be administered the same set of Table 8. Considering that facets can be fixed
30 items, among which 10 items are scored by or random in D studies, decision makers can
Rater 1, another 10 items are scored by Rater 2, propose even more variations of complicated
Table 8 Estimated variance components (EVCs) for p × (i : r) and i : r : p random effects designs based on results of the p × i × rG study in Table 6.

Source of EVC EVC in D studies


variability using 1
rater and n′i = 5 10 15 5 10 15 5 10 15
1 item n′r = 2 2 2 3 3 3 4 4 4

p × ( i : r ) Design
Person (p) 0.188 σ2p 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188
Rater (r) 0.000 σ2r ∕n′r 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Item (i:r) 0.012 σ2i ∶ r ∕(n′i × n′r ) 0.001 0.001 0.000 0.001 0.000 0.000 0.001 0.000 0.000
p×r 0.003 σ2pr ∕n′r 0.002 0.002 0.002 0.001 0.001 0.001 0.001 0.001 0.001
pi : r, e 0.472 σ2pi ∶ r,e ∕(n′i × n′r ) 0.047 0.024 0.016 0.031 0.016 0.010 0.024 0.012 0.008
Generalizability coefficient (ρ2 ) 0.793 0.879 0.913 0.855 0.917 0.945 0.883 0.935 0.954
Dependability coefficient (Φ) 0.790 0.874 0.913 0.851 0.917 0.945 0.879 0.935 0.954
i : r : p Design
Person (p) 0.188 σ2p 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188 0.188
r:p 0.003 σ2r ∶ p ∕n′r 0.002 0.002 0.002 0.001 0.001 0.001 0.001 0.001 0.001
i : r : p, e 0.484 σ2i ∶ r ∶ p,e ∕(n′i × n′r ) 0.048 0.024 0.016 0.032 0.016 0.011 0.024 0.012 0.008
Generalizability coefficient (ρ2 ) 0.790 0.879 0.913 0.851 0.917 0.940 0.883 0.935 0.954
Dependability coefficient (Φ) 0.790 0.879 0.913 0.851 0.917 0.940 0.883 0.935 0.954
GENERALIZABILITY THEORY 1339

designs to evaluate and make decisions about straightforward interpretation of construct


future measurement procedures (for details on validity.
different designs and estimation of the variance For example, with respect to measuring
components, see Brennan, 2001). learning disability, the facet of task domain
or measurement method (such as drawing
Additional Concepts and Issues a design, conceptual grouping, or numerical
memory) can be of interest for researchers.
The previous three sections are intended to Then, a Person × (Item : Domain) nested
provide a brief review on the basic concepts design can be used to examine whether
of G theory and its application in tackling the test scores from the sampled domains
measurement issues. This section focuses measure individuals consistently or not.
on three topics that are relevant to the use Analysis showing that variance components
of G theory in investigating psychological related to the domain facet, both main effect
measurements. Discussion of more advanced and interaction effects, do not contribute
topics and concepts, such as multivariate to the overall score variation thus can be
G theory and estimation of variance com- considered evidence for construct validity,
ponents for fixed facets, can be found in analogous to convergent validity. To con-
Brennan (2001) and Shavelson and Webb sider another example, researchers who
(1991). are interested in evaluating the accuracy
Use of G Theory in Validity Studies of self-report scores can conceptualize the
self-reporting method as one level in compari-
Although G theory was originally introduced
son to another level, observation by therapists,
by Cronbach and his colleagues (Cronbach
a method that has been empirically vali-
et al., 1972) as an extension of reliability the-
dated. Then, the variance components related
ory and has been frequently implemented to
to the method facet can verify whether the
address reliability issues by researchers in the
self-report scores from patients provide accu-
field of psychological and educational mea-
rate interpretations of the intended construct
surement, it can also be applied to validity
or not.
studies (Brennan, 2001; Kane, 1982). G the-
In summary, when a decision maker defines
ory conceptualizes the variance of observed
the construct to be measured and specifies the
scores as a collection of variability due to
facets and the universe, to some degree she
multiple effects, including the main effect due
also defines which facets are relevant to the
to the object of measurement as it reflects validity claims of score interpretations, which
the systematic difference among individu- are primarily related to reliability claims,
als in the construct of interest. The main and which may be pertinent to both. This
effects of facets are specified, as well as their knowledge should then guide the design, anal-
interaction effects. Facets studied in a G or ysis, and interpretation of G studies and D
D study can be more pertinent to drawing studies.
appropriate inferences or interpretations
on scores than to the precision (reliability Estimation of Variance Components
and consistency) of score interpretations. The isolation and estimation of variance
As Kane (1982) points out, a construct can components in G studies can be carried out
be operationally defined as a dispositional using various methods. This entry describes
attribute in terms of universe scores and facets a commonly used one, the ANOVA method.
as measurement conditions that should not Other methods include likelihood-based
produce any construct-irrelevant variance. methods such as maximum likelihood
He argues that this invariance law allows and restricted maximum likelihood (e.g.,
the multifacet sampling approach to offer a Searle, 1987; Searle, Casella, & McCulloch,
1340 GENERALIZABILITY THEORY

1992), resampling methods such as boot- taken into account when the variance com-
strap and jackknife methods (e.g., Brennan, ponents are examined. Large standard errors
Harris, & Hanson, 1987; Wiley, 2000), and indicate the instability of the estimates of
Bayesian analysis (e.g., Brennan, 2001). variance components; that is, the variance
Generally, the latter two groups of meth- components are not robust due to the small
ods are less sensitive to the assumptions of sample size of subjects. In particular, when
score distribution, especially for dichotomous the absolute value of a negative estimate is
scores. relatively larger than its standard error, that
When conducting G studies with the variance component then should not be treated
ANOVA method, researchers may encounter as zero. Finally, G theory can be applied to esti-
negative values for the estimated variance mate the individual-level standard error by
components, mostly in multifacet designs. acknowledging, as Brennan and his colleagues
Conceptually, variance should not be negative; (Brennan & Lee, 1999; Lee, Brennan, & Kolen,
therefore, the negative values often signal 2000) point out, that the standard error of
problems due to sample error or model mis- measurement of raw scores or scale scores
specification (Shavelson & Webb, 1981); in is conditional to the level of the measured
construct; for example, some individuals are
other words, the levels for certain facet(s) are
more inconsistent in responding to items
not comparable. Two strategies are often used
than others when they frequently guess the
to address negative estimates if the values
responses. This adds complexity in estimat-
are small: (a) Cronbach’s algorithm, which
ing and interpreting the standard error of
substitutes zero for the negative estimate,
measurement.
and uses zero wherever needed to calculate
other variance components (Cronbach et al., SEE ALSO: Construct Validity; Reliability
1972); and (b) Brennan’s algorithm, which
uses negative values in the calculation of all References
the variance components, and then sets all the Brennan, R. L. (2000). (Mis)conceptions about
negative values to zero for the estimated vari- generalizability theory. Educational
ance components at the end (Brennan, 2001). Measurement: Issues and Practice, 19(1), 5–10.
Brennan’s is the preferred approach. That said, Brennan, R. L. (2001). Generalizability theory. New
statistical methods other than ANOVA, such York: Springer Verlag.
as Bayesian analysis and maximum likelihood, Brennan, R. L., Harris, D. J., & Hanson, B. A.
(1987). The bootstrap and other procedures for
can be used to avoid the problem of negative
examining the variability of estimated variance
estimates. components in testing contexts (ACT Research
Standard Error Estimates Report Series 87–7). Iowa City, IA: American
College Testing Program.
One critical concept in interpreting the results Brennan, R. L., & Lee, W. -C. (1999). Conditional
from G and D studies is the estimated standard scale-score standard errors of measurement
error. Like the standard error of measurement under binomial and compound binomial
in classical reliability theory, the standard assumptions. Educational and Psychological
error estimates based on relative or absolute Measurement, 59, 5–24.
measurement errors can inform researchers Cronbach, L. J. (1951). Coefficient alpha and the
about the confidence interval for individuals’ internal structure of tests. Psychometrika, 16(3),
297–334.
universe scores. Results of G and D studies
Cronbach, L. J., Gleser, G. C., Nanda, H., &
can be interpreted meaningfully through the Rajaratnam, N. (1972). The dependability of
comparison between the interval range and behavioral measurements. New York: Wiley.
the score scale. Furthermore, the standard Cronbach, L. J., Linn, R. L., Brennan, R. L., &
error of the variance components should be Haertel, E. H. (1997). Generalizability analysis
GENERALIZED ANXIETY DISORDER 1341

for performance assessments of student


achievement or school effectiveness. Educational
Generalized Anxiety
and Psychological Measurement, 57, 373–399. Disorder
Cronbach, L. J., Rajaratnam, N., & Gleser, G. C.
(1963). Theory of generalizability: A Sandra J. Llera1 and Michelle G.
liberalization of reliability theory. British Journal Newman2
1 Towson University, U.S.A. and 2 Pennsylvania State
of Statistical Psychology, 16, 137–163.
Gleser, G. C., Green, B. L., & Winget, C. N. (1978). University, U.S.A.
Quantifying interview data on psychic
impairment of disaster survivors. Journal of Generalized anxiety disorder (GAD) is char-
Nervous and Mental Disease, 166(3), 209–216.
acterized as a disorder of excessive worry and
Kane, M. T. (1982). A sampling model of validity.
anxiety that is not targeted toward any one
Applied Psychological Measurement, 6, 125–160.
Kirk, R. E. (1982). Experimental design: Procedures
type of situation and is therefore generalized.
for the behavioral sciences (2nd ed. ). Pacific Given that the constructs of worry and anxiety
Grove, CA: Brooks/Cole. underlie many of the psychological disorders,
Lee, W. -C., Brennan, R. L., & Kolen, M. J. (2000). GAD has been considered the basic anxiety
Estimators of conditional scale-score standard disorder. In early nosological descriptions,
errors of measurement: A simulation study. there was much overlap in symptomatology
Journal of Educational Measurement, 37, 1–28. between GAD and other disorders; in fact,
Searle, S. R. (1987). Linear models for unbalanced GAD has only recently been defined as an
data. New York: Wiley. independent disorder in its own right. Since
Searle, S. R., Casella G., & McCulloch, C. M. (1992). then, researchers have worked to further elu-
Variance components. New York: Wiley. cidate the nature, etiology, and mechanisms
Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. of GAD.
(1999). Note on sources of sampling variability in
science performance assessments. Journal of
Educational Measurement, 36, 61–71.
Definition
Shavelson, R. J., & Webb, N. M. (1981). According to the current edition of the DSM
Generalizability theory: 1973–1980. British (DSM-5; American Psychiatric Association
Journal of Mathematical and Statistical 2013), GAD is defined by its central crite-
Psychology, 34, 133–166. rion of excessive, chronic, and uncontrollable
Shavelson, R. J., & Webb, N. M. (1991).
worry over a number of different topic areas.
Generalizability theory: A primer. Newbury Park,
For instance, people with GAD may worry
CA: Sage.
Webb, N. M., Shavelson, R. J., & Haertel, E. H. about their relationships with others, finances,
(2006). Reliability coefficients and school or work performance, health, and even
generalizability theory. In C. R. Rao & S. daily chores. This pattern is distinct from the
Sinharay (Eds. ), Handbook of statistics (Vol. 2, worry that characterizes other disorders. Those
pp. 81–124). Amsterdam: Elsevier. with panic disorder may worry about having
Wiley, E. W. (2000). Bootstrap strategies for variance another panic attack, those with an eating
component estimation: Theoretical and empirical disorder may worry about their weight, and
results. Unpublished doctoral dissertation, individuals with a specific phobia (e.g., dogs
Stanford University, CA. or heights) may worry about encountering
their feared object or situation. In contrast,
those with GAD worry about a wide variety of
Further Reading topics, and the target of their worries may vary
Hoyt, W. T., & Melby, J. N. (1999). Dependability of throughout the day.
measurement in counseling psychology: An Worry has been defined as a sequence of
introduction to generalizability theory. perseverative thoughts and images associ-
Counseling Psychologist, 27(3), 325–352. ated with negative affect and focused on

View publication stats

Das könnte Ihnen auch gefallen