Sie sind auf Seite 1von 11

Classical Test Theory

Measurement theory has developed from CTT in in early 1900s from work of Spearman
(1904) on measurement of individual differences in mental abilities. Since then, CTT has been
widely used in many research areas. Two features, validity and reliability, are the main test
indices in CTT. There are several different types of validity: content validity, criterion-related
validity, and construct validity (Allen & Yen, 2002). Content validity can be established through
face validity and logical validity. Content validity is defined as a qualitative type of validity
where the domain of the concept is made clear and the analyst judges whether the measures fully
represent the domain (p.185) (Bollen, 1989). Criterion-related validity including predictive
validity and concurrent validity is usually measured by the degree of relation between the test
and the external criteria. Multitrait-Multimethod (MTMM) method which can be measured by
the degree of convergent and discriminant validation (Campbell & Fiske, 1951) and factor
analysis are types of construct validity (Allen & Yen). Construct validity is to establish
theoretical construct by psychometric methods such as exploratory and confirmatory factor
analysis and MTMM.
There are many different methods for defining and estimating reliability. Within CTT,
reliability is defined as the proportion of the observed score variance to the true score variance.
Test-retest, parallel-forms, alternate-forms, and internal-consistency (including the SpearmanBrown and Cronbachs alpha) are the ways to estimate reliability (Allen & Yen, 2002). Testretest reliability refers to the strength or weakness of correlation between two test results which
are measured by the same examinees with the same test at different time. Test-retest reliability
estimate is affected by serious problems such as carry-over effect and the length of time interval
between the two tests (Allen & Yen). Similar to the test-retest reliability, parallel-forms or

alternate-forms reliability is measured by the correlation between two parallel tests or two equivalent tests. The criterion for parallel tests is that the true scores and error variances of two
tests should be identical, which is almost impossible to meet in practice. The -equivalent tests
require that the true score of one test should be a linear function of the true score of the other test
and the error variances do not need to be identical. Internalconsistency estimates are methods to
test reliability with two divided parts of items from the same test. Internal-consistency estimates
have advantage to avoid the problems by repeated testings from the procedures of test-retest and
parallel-forms reliability. In case of two subtests are parallel, Spearman-Brown formula can be
utilized to estimate internal consistency while Cronbachs alpha can be used for the case where
two test are tau-equivalent.
Validity and reliability estimations are established by the framework of CTT. The
fundamental feature of CTT is the formulation of observed outcome (X) as a composite of two
independent components, an underlying true-score component (T) and measurement error (E):
Xip = Tip + Eip
In this framework, the true score (T) for item i and person p is defined as the expected value of
an individual's observed scores on the repeated assessments with the same instrument to the
same examinee under an identical condition. There are several assumptions in CTT (Allen &
Yen, 2002). First, the expected value of observed scores is true score. The expected value of
error scores in the population is zero and the error scores are normally distributed. Second, there
is no correlation between the true scores and error scores. Third, the error scores from two
different tests are not correlated. Forth, there is no correlation between the true score from test 1
and the error score from the test 2 in the population. The fifth assumption is the existence of

parallel tests. The conditions for parallel tests are that two test have the same true scores (T1 =
T2) and that two error score variances are identical (

). The last assumption of CTT is the

existence of -equivalent tests. The -equivalent tests assumption requires T1 = T2 + C (constant).


The equal error variance condition does not apply to the -equivalent tests assumption.
There are several item indices in CTT including the p-value, the d-value, and the
item-test correlation (Allen & Yen, 2002). The p-value known as the item difficulty can be
computed with the proportion of the number of people who have correct answer for item i for the
total sample size. The p-value is ranged from 0 to 1. The d-value (item discrimination) is
estimated as:
di = Ui/niU - Li/niL
where the Ui/niU represents the ratio of the number of examinees in the upper group who have the
right answer for item i to the total number of examinees in the group. The Li/niL is the
proportions of the number of examinees in the lower group who have the right answer for item i
to the total number of examinees in same group. The sample size of the upper group and lower
group is same or similar in many cases and is ranged from 10% to 33% out of total sample. The
item-test correlation as an alternative method of item discrimination (d) can be computed as:

where is the mean of the item i and and

are the mean and standard deviation of the test.

is the item difficulty. The point-biserial correlation (


between each item and the test (Allen & Yen, p.38).

) provides the index of the association

Although CTT has several item-level indices, the main focus of CTT lies in test indices
such as reliability and validity. Theoretical and practical shortcomings and difficulties for
interpreting item and test indices in CTT are arising in psychological and educational
measurements (Hambleton & van der Linden, 1982). According to the study of Hambleton and
van der Linden, CTT is test-dependent score. The true score is considered as the person
parameter and test parameter as well because it is measured by the expected value of observed
scores. The true score is person dependent and also test dependent in CTT. In addition, item
parameters including p-value, d-value, or item-test correlations and test parameters such as
reliability and validity depend on the sample characteristics. The last shortcoming is related to
the observed test score and measurement error. For instance, reliability coefficient is yielded by
the components of the observed test score variance and error score variance under assuming the
existence of the parallel tests which is impossible to meet in practice.
Unlike CTT which has the theoretical and practical serious problems with test-dependent
person parameter, sample-dependent item parameters, and the parallel test assumption, modern
measurement theory known as IRT developed by Lord (1952) and Birnbaum (1968) offers many
important advantages over CTT. Embretson and Reise (2000) described benefits of utilizing IRT
rather than CTT (Table 2.1, p. 15);

Rule 1
Rule 2
Rule 3
Rule 4
Rule 5
Rule 6

The old rules (CTT)


The standard error of measurement applies to all scores in a particular
population.
Longer tests are more reliable than shorter tests.
Comparing test scores across multiple forms is optimal when the forms are
parallel.
Unbiased estimates of item properties depend on having representative
samples.
Test scores obtain meaning by comparing their position in a norm group.
Interval scale properties are achieved by obtaining normal score
distributions.

Rule 7

Mixed item formats leads to unbalanced impact on test total scores.


Change scores cannot be meaningfully compared when initial score levels
Rule 8
differ.
Rule 9 Factor analysis on binary items produces artifacts rather than factors.
Rule 10 Item stimulus features are unimportant compared to psychometric properties.
The new rules (IRT)
The standard error of measurement differs across scores (or response
Rule 1
patterns), but generalizes across populations.
Rule 2 Shorter tests can be more reliable than longer tests.
Comparing test scores across multiple forms is optimal when test difficulty
Rule 3
levels vary between persons.
Unbiased estimates of item properties may be obtained from unrepresentative
Rule 4
samples.
Rule 5 Test scores have meaning when they are compared for distance from items.
Interval scale properties are achieved by applying justifiable measurement
Rule 6
models.
Rule 7 Mixed item formats can yield optimal test scores.
Change scores can be meaningfully compared when initial score levels
Rule 8
differ.
Rule 9 Factor analysis on raw item data yields a full information factor analysis.
Rule 10 Item stimulus features can be directly related to psychometric properties.

Traditional IRT models share the unidimensional assumption that a single underlying
latent construct is assumed for the observed responses to each of the tests items. Insert the local
independence assumption with some explanations. The 1PLM, or Rasch (e.g.,Wright, 1977),
model is one of the simplest IRT models. Only a single item parameter (item difficulty) is
required to represent the item response process. It is defined as the score with a 50% likelihood
of a correct item response. In the hypothetical item presented in Figure 1, the b parameter would
equal 0.0. (Correct this sentence.) The cumulative density function of the 1PLM is given by put
the original formula for 1PLM (?) before you connect it to the new formula.
P(yij = 1 ) = (j bi).

Figure 1. Item characteristic curve for a hypothetical test item.


NOTE: The horizontal axis represents the underlying construct to be measured with the vertical
axis representing the observed probability of a correct response.
All items of the 1PLM share identically shaped Item Characteristic Curve (ICC) only
with different location parameter values of b. The 2PLM adds a parameter known as the item
discrimination parameter which allows different ICCs for different items to exhibit different
slopes. ?? Items with higher an item discrimination parameter values provide more information
about the examinees ability at a specific location in the ability distribution than other items with
lower values of item discrimination parameter. The cumulative density function of the 2PLM is
given by
P(yij = 1 ) = ai(j bi).

The 3PLM adds the lower asymptote of the ICC which is the expected proportion of
correct responses from individual with low ability.

The ultimate duty of psychometricians is to estimate both item and person parameters utilizing
various estimation methods including maximum likelihood method and diverse Bayesian
methods. The fundamental principles of maximum likelihood method is to estimate the
underlying proficiency of parameters with the likelihood function. Some formulae?? The

Bayesian methods use the prior distributions to compute the posterior probability based on the
Bayes principle including MAP and EAP. Describe the MCMC method.

Testlet Response Theory (TRT)


One of the two assumptions in the unidimensional IRT is local independence of items.
However, in the area of reading, a typical format of reading comprehension tests contains various
passages followed by multiple items related to the passage. In this case items for the same
passage are correlated to each other, which is a clear violation of the local independence
assumption of the unidimensional IRT. An item bundle is a small group of multiple choice items
that share a common reading passage or graph, or a small group of matching. Although bundled
items may violate the latent conditional independence assumption of unidimensional IRT,
different unidimensional Rasch model or two- or three-parameter IRT models have been used for
the estimation of which estimate not only person ability and item parameters ignoring the
violation of the assumption difficulty, but also item discrimination and an additional guessing
parameter are often used (de Ayala, 2009; Yen & Fitzpatrick, 2006).
Item bundles (Rosenbaumm 1998; Wilson & Adams, 1995), context-dependent item set
(Haladyna, 1992; Keller, Swaminathan, & Sireci, 2003), or testlets (Wainer & Kiely, 1987)
allow for measurement of interrelated tasks and skills. Many researchers reported that that
ignoring LD caused several problems: (1) overestimation of the precision of person ability
estimates; (2) underestimation of the standard errors of parameter estimates; and (3) biased item
parameter estimates, in particular biased item difficulties and item discriminations (e.g., Chen &
Thissen, 1997; Chen & Wang, 2007; Sireci, Thissen, & Wainer, 1991; Tuerlinckx & De Boeck,
2001; Wainer & Wang, 2000; Wang & Wilson, 2005b; Yen, 1984, 1993; Zhang, Shen, &
Cannady, 2010). Thissen et al. (1989) and Sireci et al. (1991) pointed out that violating the

assumption of LD would lead to an overestimate of reliability and test information, as well as an


underestimate of the standard error of measurement.
To account for LD associated with items nested within a testlet, Bradlow, Wainer, and
Wang (1999; Wainer et al., 2007) proposed the 2PL testlet model. The 2-PL TRT model includes
a random-effects parameter representing the interaction of person j with testlet d(i), the testlet
that contains item i. In this model, the probability of a correct response to an item i nested in
testlet d(i) for a person n with ability j is given by
P(yij = 1 ) = ai(j bi

).

Allen, M. J. & Yen, W. M. (2002). Introduction to measurement theory, Waveland Press,


Inc.
Bollen, K. A. (1989). Structural Equations with Latent Variables (p. 179-225). John
Wiley & Sons,
Campbell, D.T. and Fiske, D.W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.

Das könnte Ihnen auch gefallen