Beruflich Dokumente
Kultur Dokumente
0. Introduction
A test, questionnaire, observation sheet, coding system for think aloud data etc.etc. may
be reliable without being valid. An instrument is 'reliable' simply if it consistently works
to give the same results every time, if the same cases are remeasured. But the
measurement could still be consistently 'wrong' (biassed). Two markers may agree on the
scores they give a student for fluency in an oral test, but they may have both
misunderstood the instructions on how fluency is to be scored. Checking on reliability is
basically detecting the amount of random error in one's measurement of variables,
checking on validity is trying to detect constant error.
An instrument or other data gathering and measuring procedure is valid only if it
measures what it is supposed to measure, but one can only ask about that if it is in the
first place reasonably reliable. There is no point checking up on what variable an
instrument measures or records unless it is consistently measuring something. So validity
checks should in theory follow reliability checks, only for those instruments that do have
satisfactory reliability. In practice, however, as we see below, one type of validity,
Content validity, is often established before reliability is checked.
It should be noted that, in practice, there can be as much argument in a research project
about what the appropriate variables are that should be quantified, and how to define
them properly, as about the precise validity issues of whether this or that instrument
measures the variables correctly.
The notion of validity can also be applied more widely to entire research projects. It then
includes consideration of the all the things that make for good and bad sampling of cases
(external validity), and control of unwanted variables that may interfere with results,
design, and so forth (internal validity) as well as validity of the measuring instruments
themselves. However here we focus just on the validity of the measuring and recording
instruments - fundamental both to research and to pedagogical uses of such means of
quantifying data, and a topic often more written about in the literature on testing than in
that on research methods.
Traditionally four 'types of validity' of instruments are often recognised: really four ways
of checking if an instrument does measure what it is supposed to.
Some add further types, especially these two which I shall briefly describe but leave:
'Face validity'. This is really only concerned with how the instrument appears to
subjects and users - i.e. whether to the ordinary person it looks as if it measures
what it is supposed to - not the essential matter of what it really measures.
Nevertheless in practical research it may be necessary to ensure face validity as
well as 'actual' validity. Especially in pedagogical contexts teachers and learners
will expect a grammar test to look like a grammar test.. to them.
'Consequential validity'. This refers to actual and potential outcomes of use of an
instrument, rather than the instrument itself. It is a controversial concept in that
where and when attention to the social and political ramifications of test use should
be addressed is arguable. I.e. an instrument might be said to have good
from those a set to try out as an instrument, or get learners to comment on your items in a
pilot study.
The only problem with using people who resemble future subjects, rather than 'experts',
as judges of validity is that if some complex theoretical concepts are involved in the
definition of the targeted variable, you cannot expect people who resemble subjects to
judge directly if the items relate to it, though they may still offer comments that help you
judge if the items are in fact relevant. Even what might seem like the everyday concept of
'anxiety' has behind it a (mainly psychological) literature which distinguishes for example
'trait anxiety' (which is a permanent feature of a person's personality) from 'situational
anxiety', and so on which one cannot expect subjects to take into consideration. Other
variables may be even more arcane from the point of view of the ordinary subjects, and
even some applied linguists! (e.g. field dependence, integrative orientation,
metacognitive strategic competence, etc.).
Where relevant, the content of an instrument should pay due attention to/be compatible
with any relevant theory - whether linguistic, psycholinguistic or whatever. If your
instrument is a test designed to measure whether people have acquired the INFL
constituent in English or not, then you and the judges need a grasp of the relevant parts of
Chomskian grammar to see if indeed the items relate to all and only aspects of grammar
that belong in INFL. If you are using a set of categories to classify strategies you have
identified in think aloud data, then in establishing a suitable set of categories to use you
need to be aware of the relevant discussion in the literature of how to define strategies,
what classifications have been used before, etc. This is where the expert judgment comes
in.
Strictly there is a difference between what one should be judging as suitable content for
an instrument with absolute value as against one with relative interpretation. Take a test
of knowledge of English phrasal verbs. To check its content validity first of course one
has to be clear how 'phrasal verb' is being defined - in the narrow linguist's sense, distinct
from prepositional verbs like look at - or in the broader teacher's sense, which includes
both? Then, if the test is supposed to record the absolute (criterion referenced) construct
of how many phrasal verbs some subjects know, one needs to check that the test includes
a random sample from the entire set of phrasal verbs, as defined. If, however, the test is
intended only to measure a relative construct - who knows more phrasal verbs than who
(relative/norm-referenced measure) - it is sufficient to check that all the items really are
about phrasal verbs, with no other types included, and that a reasonable range of them has
been included: the sort of considerations discussed under reliability and item analysis for
norm-referenced tests would then be involved as well, and they would ensure that the
verbs included are in fact of moderate difficulty for the targeted subjects (excluding ones
that are too easy or too difficult). Often for research purposes the relative measure is most
suitable: if one needs to test subjects' reading ability for a study, it is usually in order to
distinguish 'good' from 'bad' readers purely in the sense of 'relatively better' and
'relatively worse' ones, so there would be no point in choosing a text for the test which
was too easy or too hard for them.
PhD students don't always get anyone other than themselves to check their instruments
for content. At least the supervisor should be involved as a second expert, especially as
you don't even have to pilot the instrument to do this! And if piloting an instrument,
always get as many open comments as possible from the subjects used about their
experiences using it, and what they thought it was measuring, as these may give valuable
hints as to whether some misunderstanding has occurred which might affect validity. In
particular you may discover that they are using test-taking or task-performing strategies
which mean that your instrument is measuring their successful use of these strategies as
much as their possession of whatever ability the test is testing (see McDonough, Strategy
and Skill in Learning a Foreign Language ch6). Or it may emerge that they are using
knowledge that you had not intended to be measured by this instrument: e.g. imagination
in written task which you intend merely to measure overall text organisation, or real
world knowledge in test items you had intended to test just vocabulary knowledge.
Can you think of some examples of instruments where some variable other than
the intended linguistic one would often get measured
There have been some more systematic attempts by researchers to investigate the validity
of various types of test by think aloud research. This compares the strategies test-takers
use when taking a test with the strategies they use when doing the 'same thing' otherwise.
E.g. do learners doing a conventional reading comprehension test where they read a
passage and answer multiple choice questions read the passage using the same strategies
that they use when reading a passage in a non-test situation? You can guess the answer
(see McDonough ch6 again). If the notion of the 'content' of an instrument is allowed to
include the skills used to tackle the items as well as what is in the items, as it well might,
this is all loosely part of content validation.
There is little statistics involved in any of this. To demonstrate content validity of an
instrument you are using you can of course record the % of items in a test or inventory
that all judges agreed represented the targeted variable. Or on any single item or an
instrument as a whole one can record the % of judges who agreed it was valid. However,
more usually one uses the information at once to improve the instrument before using it
for real (and reports on this in one's write-up). Typically any item or instrument as a
whole that at least one judge objected to would be revised. In other words, this often
involves a form of 'item analysis'.
How does this differ from the sort of item analysis that often goes hand in hand
with the assessment of internal reliability of an instrument?
2. Concurrent validity
Checking up on concurrent validity means checking one's instrument by comparing it
with another one that you know is (more) valid (the 'criterion' measure or instrument). An
obvious limitation is that you cannot use this validation method on any instrument where
no known 'better' instrument exists to be compared with! In much language research,
especially perhaps psycholinguistics, one is measuring things that have not been
measured before. And if a better instrument does exist, in research work the suggestion
will always be there: why not use the known 'better' instrument in the first place, instead
of this other one? Maybe the reason will be cost - compare (3) and (7) in the Introduction.
Furthermore, there may be disagreement as to which is the instrument one can safely
assume is more valid. It is often assumed that 'direct' measures (more naturalistic
methods of measurement) are more valid than 'indirect' ones (e.g. artificial tests). More
elaborate extensions of this concept of validity form the multimethod part of
The columns are understood as labelled the same as the rows of the same number, so the
first figure in the top line (.241) is the correlation between the pertinence ratings of the
texts and the clarity ratings. This is in fact a correlation internally between components of
the instrument being validated, not directly throwing light on the concurrent validity of
the instrument as a whole. In such tables the diagonal is usually left empty because the
correlation of a measure with itself is +1, though sometimes this space is used to display
the standard deviation of scores on each variable. And normally only one triangle is
displayed, not the full square of correlations, because the other triangle would merely
repeat the same information over again (SPSS however gives you everything twice).
1
1 Pertinence rating
2 Clarity rating
3 Structural accuracy count
.241
-.019
.408
.459
.733
.933
.858
.805
.723
4 Overall communicative
instrument (1-3 combined)
.953
not a teacher). Again the natural way of examining and quantifying the agreement is by a
method already seen for reliability - a contingency table display and the measure of
agreement suitable to data with absolute value - % agreement (or better Cohen's kappa
coefficient):
Teacher judgment
CR Test
Masters
Nonmasters
Masters
Nonmasters
13
This one was not illustrated in the Introduction, as it does not really apply to weighing
machines, and indeed not to many instruments used in language research. In fact one can
only check predictive validity for an instrument that claims to measure a variable that by
its nature has some forward-looking element to it by definition, or can be assumed to
predict later performance on some other variable. Tests of language learning aptitude,
reading readiness and the like are the most obvious examples. They can then be checked
by seeing if people who do well on them also do well at a later time on whatever the
instrument claimed to predict (i.e. actually learning a language, reading etc.). The latter,
outcome, measure serves as a criterion to assess the former, predicting, measure just as
the known more valid measure serves as criterion to assess the questionable one in
concurrent validation.
3.1 Predictive validity: variables on interval scales
Predictive validity is often quantified by a correlation measure of some sort, as the
commonly used instruments that it applies to are tests yielding interval scores. Thus
Curtin et al. (1983) report testing out Pimsleur's foreign language learning aptitude test at
University of Illinois High School. Correlations between the aptitude test scores and
grades obtained later at the end of the first year of learning a foreign language were not
very impressive:
French
.35
German
.305
Russian
.174
Latin
.449
Though nowhere near +1, at least the correlations are all positive - there were no
languages for which higher scores on the aptitude tests foretold lower learning
achievement. Those who did relatively well on the aptitude test were most likely to be
later relatively successful at learning Latin - by the means of instruction used in the above
institution, of course. As usual these correlations tell us nothing about any absolute levels
of achievement in the languages concerned. Further information might be gleaned from
an examination of the four scatterplots for the aptitude and specific language achievement
scores. For instance it might pay to see which cases were "spoiling" the lower
correlations and whether they had anything else in common.
3.2 Predictive validity: variables on category scales
A criterion-referenced 'mastery' test which purports to predict, from attainment of a preset
accept or reject level on the test, success or failure on an ensuing course, would best be
analysed like the Black and Dockrell example discussed above, together with a 'false
positive analysis' etc. In practice the problem may be that you will get followup
information only on those who 'passed' the predictive measure, as only those were
allowed to go on to learn, so you will not be able to complete the contingency table. An
example where all the figures are available is afforded by Moller (1975), who examined
the predictive validity of the Davies English test, administered before embarkation to
foreign students coming to the UK, as an indicator of adequacy of their English for
postgraduate courses in a variety of subjects. (The Davies test was popular in the days
before the development of IELTS). The followup criterion here was UK supervisor rating.
Pre-embarkation test
Supervisor
Adequate
Adequate
Inadequate
85%
10.5%
Inadequate 3.5%
1%
We can illuminate the examination of construct validity within a complex instrument a bit
more by examining the Fischer example study again (see 2.1). Many instruments used in
language research are not complex in this way however. Columns 2-4 of the matrix in the
above section show the internal analysis in correlation terms of Fischer's instrument.
Examining these is effectively one way of checking 'construct validity' since the three
component measures were chosen, with some theoretical backing, as supposedly
quantifying different complementary variables which make up a meaningful single
construct. We therefore expect them not to agree much with each other but to agree with
the overall measure derived from them. Column 4 shows the correlations of each
component variable with Fischer's overall measure. Unlike all the other correlations,
which are "free" to arise or not depending on empirical fact, there is bound to be some
degree of correlation here because of the a priori connection between any overall measure
and its component variables. To confirm validity we expect pretty high correlations,
which we get for two out of the three components in fact: we can see again that the
pertinence rating correlates less well with the overall measure than the other two
component variables do, and the clarity rating the best. Thus the clarity rating would be
the best choice as a single indicator for the instrument as a whole, if you so wished.
Generally you would hope for reasonably high correlations between component variables
and a combined measure made from them, otherwise it might seem that the theoretical
framework that led you to think of them as components of "one" variable was wrong.
Columns 2 and 3 enable you to examine the correlations just between the three grassroots variables themselves. In a composite instrument such as this you would not expect
high correlations here otherwise it might be argued that there is no real point in
combining three separate measures - one would do. In fact two of the three are quite low,
one being very slightly negative. That means those who do well on the pertinence rating
score if anything relatively low on the structural accuracy measure. Indeed correlationtype coefficients can extend all the way down to -1, if relatively high values on one
variable are scored by cases who score low ones on the other and vice versa. This on the
whole supports the logic on which the measure was established, based on the theory that
the three variables are more or less independent contributors to the one composite
construct. If more contributory variables had been measured, the statistical technique of
'factor analysis' could have been used to sort out which were the essential ones,
quantifying definitely distinct components of the composite measure.
4.2 Construct validity of an instrument in relation to other, different, variables
This type of construct validation is really the most fundamental and most widely
available type of all, though often overlooked by researchers who limit themselves to
pursuing the 'easier' content validation. It can in the end involve any of the standard
research designs. A simple example of its use is where, as part of a study of the
acquisition of INFL in English, say, a researcher includes learners of several proficiency
levels, and some adult native speakers of English. All are tested with a test designed to
show mastery of this aspect of English grammar. The hypotheses naturally are that higher
proficiency levels will show greater mastery and NS the highest of all. However, from the
point of view of substantive research these hypotheses are of the type sometimes called
'straw men'. It seems somehow 'obvious' that native speakers will do well and that
learners will do less well, and that learners of higher proficiency will do better than those
with lower proficiency. Such hypotheses really cover things we know already and it
hardly seems worth doing a study just to show them to be true. Good hypotheses to
follow up on for a research project are not usually ones that everyone already has proved
several times over, but something more on the edge of knowledge. So one would hope
this researcher has some other hypotheses or research questions as well - ones which
assume a bit more and focus on something less well known (like perhaps ones about
which specific manifestations of INFL are acquired first, or whether they all emerge
together). But if a research project does include some 'straw man' hypotheses, this does
have a value as a form of construct validation. Precisely because the relationships are
already known, the new instrument to measure mastery of INFL can be checked for
validity by seeing if it yields the assumed relationships. This employs the 'argumentation
in reverse' that is typical of construct validation. In other words, when the data is
gathered, do we find an increase in mastery at higher proficiency levels, with near perfect
scores for native speakers? If we do, that suggests our INFL test is valid. If not, we would
wonder why and check the test (though if this was all done in the main study, not a pilot,
it is too late to change it!).
What statistics would probably be involved in that example?
To further characterise what goes on in construct validation I shall sketch some other
approaches that could have been used in the Fischer example above, though they were
not. Typically they would have concentrated on external relationships of Fischer's overall
instrument with measures of what are assumed to be clearly separate variables, neither
components of Fischer's nor criterion measures of much the "same" construct. The
criterion measure - the native speaker rating - could also be pursued in the same way. For
either instrument this could be done either again via correlations, or via research
involving comparisons of groups or conditions similarly.
4.2.1 Construct validity: correlational design
The correlational approach would again generate a matrix of correlation coefficients, this
time between Fischer's instrument and several other different variables. Here you would
look to see if the correlations obtained accorded with what theory would predict the
relationship to be. For instance written communicative competence would not be
predicted to correlate particularly with explicit metalinguistic knowledge of linguistic
terminology, or with intelligence, so you would expect near zero coefficients. This is
sometimes called 'discriminant validation'. On the other hand you would assume a
positive relationship with oral communicative competence. Checking on this sort of
relationship is sometimes called 'convergent validation'.
In a different realm, Fasold (1984 p120f) reports the convergent approach being used to
validate census reports on native language in Quebec. You would a priori feel able to
assume that the proportions of adults reporting themselves as mother tongue French or
English speakers to match closely the proportions of children independently recorded as
enrolling in Catholic and Protestant schools, and indeed this turns out to be so. So the
trustworthiness of the census is supported. The divergent approach is an essential
component in the MTMM method of validation (my book 21.2.3).
More elaborate studies along these lines use factor analysis which can be seen as being
able to analyse a whole set of correlations together rather than pairwise and identify
which groups of variables are clubbing together to place cases in a similar order with
similar spacing of scores - i.e. mutually correlate highly. In particular, 'confirmatory
factor analysis' is used where you have a prediction to test about what the correlations
should be, as you will have in a proper construct validation exercise.
4.2.2 Construct validity: non-experimental independent groups design
The group comparative approach was illustrated by the INFL example in 4.2. Fischer's
instrument could similarly be checked by trying it with native speakers as well as
learners. This would be permitted since it was not specifically norm-referenced to
learners only. Something similar was done for the TOEFL (Angoff and Sharon 1971). On
instruments such as these, if it emerged that a group of learners did better than a group of
native speakers, clearly this would be counter to assumptions and you would question the
validity of the measure. Here the EV - the categorisation of cases as 'native speaker' or
'learner' - is the "other" variable that the measure in question (used as DV) is checked for
its relationship with.
Additionally for instruments made up of many items, Rasch analysis (see reliability) can
be used to show if items are got right more by successively higher levels/ages as they
should.
4.2.3 Construct validity: experimental designs (independent groups and repeated
measures)
Instead of using existing characteristics of cases, such as being a native speaker or not,
often it is suitable to 'make' an EV in a fully experimental investigation of construct
validity. So you could use Fischer's instrument on two or more groups of cases after they
are exposed to different conditions or treatments which you feel confident on a priori
grounds will affect their written communicative competence. E.g. you teach one group
communicative writing but not the other. If Fischer's instrument duly records the assumed
difference, then that supports its validity. The term 'treatment validity' is sometimes used
for this specific approach.
The same 'treatment' approach is widely used in validating criterion-referenced
achievement tests in pedagogical contexts (Black and Dockrell 1984 p92). You
administer a questionably valid test of, say, apologising appropriately in English, to a
group who have been taught the content on which the test is supposedly based, and a
group who have not. The natural assumption here is of course that the former group
should score higher than the latter, if the test is valid. A variant of this approach would
rather involve administering the test twice to the same group, before and after instruction,
to see if the assumed improvement registers or not. Not dissimilar also, is the following
procedure, often done quite informally. In a multi-item test or inventory you intersperse
'control' items on which you have a priori assumptions of what the responses should be.
For example, if cases are rating the 'idiomaticity' of some phrases you may include some
that are clearly non-idiomatic as a check on whether the cases are using the same
definition as you.
4.2.4 Conclusion on construct validation statistics
At this point it is worth drawing attention to some differences between the use of
correlation coefficients to quantify validity and their use in connection with reliability.
First, we have just seen that near zero and negative values may well arise in validity
work: these would be most unexpected in a reliability study (or indeed concurrent or
predictive validity study). Second, correlation matrices are most often seen in a validity
study. In reliability work where you have, say, a set of three or more repetitions of a test,
you would probably try to quantify the overall agreement between them rather than
examine the agreement between each possible pair of occasions when the test was
redone. Finally, in the study of validity you are studying the relationships between
different variables and so often comparing instruments yielding scores on different scales
of the same general type. For instance, Fischer's instrument has a maximum score of 12,
the criterion rating one of 24. That again does not arise in reliability checking as by
definition the same measure on the same scale is repeated in some way. This causes no
problem for correlation-type coefficients, incidentally, since they only quantify agreement
in a relative sense and are largely unaffected by differences in score level on the two
variables compared.
Further, something I have generally glossed above but which must be considered when
interpreting the correlation coefficients is the number of cases involved. To oversimplify,
what constitutes a correlation that is big enough in a positive or negative direction to be
taken as marking any convincing relationship (versus no marked relationship at all)
depends on the number of cases. The threshold values can be found on significance tables
in basic statistics texts. The threshold value for Spearman rho for 18 cases is in fact .399,
so we see that two of the correlations obtained for Fischer's instrument are not high
enough to indicate any real relationship at all (using the customary .05 level of
significance).
In a similar way, the key statistical information that emerges from the group comparison
and experimental methods is an indication of probability, signified by 'p', and perhaps
other measures of amount of relationship like eta squared (rather than a correlation
coefficient). These can be arrived at by any number of statistical means appropriate to the
particular design used. As customary, if p is less than 0.05, researchers would generally
assume that a difference in DV scores worth paying attention to has been demonstrated
between the groups or conditions. Technically the difference is said to be 'significant'. See
further a text such as Langley (1979) or Rowntree (1981) for a simple account of the
reasoning behind significance tests and p values and their interpretation. In the present
application, then, a p near zero would indicate a relationship between the EV variable and
the one whose quantification is in question. You would have to further look at the scores
to see if the relationship was in the assumed direction - e.g. that native speakers did do
significantly better rather than worse than learners on Fischer's measure, and so support
its validity.
All these approaches involving validation against measures of clearly different variables
share the characteristics of being like normal research designs, but done "back to front".
Indeed any of the many designs of investigations that are used in research can be used in
this back-to-front way as validity checks. To reiterate: what I mean by "back to front" is
as follows. Normally in language research you assume that the variables that are of
interest can be quantified validly and seek through correlation study or experiment etc. to
find out something about relationships between them. In the present situation you rather
have to assume you know the relationship between the variables, and that measures of
variables other than the one whose validity is in question are valid, in order to find out
If you do use in this way several methods of measuring the 'same thing' (or at least of
what you intend to be the same construct) of course it looks a bit like concurrent
validation, but is actually quite different. You might for example measure reading ability
both with a cloze test (filling gaps in a text), and a multiple choice test of comprehension
following reading a passage, neither of which you are sure of the validity of. You can
even compare the results (e.g. with correlation) to see how far they agree, in addition to
combining the scores to get 'the best measure'. But this is not concurrent validation unless
you are able to make the assumption that one of the instruments is definitely valid. If the
two measures agree, it is encouraging but does not actually confirm their validity if you
know the validity of neither in any other way. There could be a constant error shared by
both, just as in reliability analysis. And if there is a big difference in the results, you don't
know if it is because one is valid and the other not, or because both are invalid in
different ways, or what. Gathering qualitative data in the form of comments from the
subjects afterwards may help decide. But there is always the risk that you create a set of
scores with less validity by combining the results of a highly valid instrument with those
of an invalid one.
Probably the best way to decide on what multiple ways of measuring the same thing to
use is to choose ones which appear to potentially involve rather different sources of
invalidity, in the hope they balance out. For instance, to get a balanced overall measure of
vocabulary knowledge you would combine tests with both written and oral and picture
stimuli, so that it is not invalidated by being partly a reading test, which could be the case
if you used only written stimuli. You need an awareness of sources of invalidity (see my
book esp. ch 21-23 and accounts of validity generally).
Investigation of validity (and indeed of reliability) can be a research topic in itself rather
than, as we have pursued it here, just something you do on the side of your 'real' research,
to check that the instruments are accurate. A particular kind of validity checking done as a
research project in itself is the 'multitrait multimethod' study (MTMM). This sets out to
systematically measure more than one construct ('trait') each by more than one type of
instrument ('method') in a parallel way. One then gets to see if the different instruments
produce more drastically different results than the different supposed constructs
(Bachman; for a clear example of its use see Gardner, Lalonde et al. The role of attitude
and motivation Language Learning 35).
When combining the results of different instruments, you need a bit of statistical
knowhow. We consider here interval scores with relative not absolute value. Just adding
the scores from three different reading tests will not be satisfactory if they are on different
scales, as the weight or contribution of each instrument will not be equal. Even if they are
all on the same interval scale, the three sets of figures may have different standard
deviations, which has an effect. If they are all at least interval, one method is to turn all
the scores into standard scores (also known as z scores), which puts them all on scales of
the same length with the same mean and SD. In SPSS this can be done by going to
Analyze...Descriptive statistics...Descriptives, and marking the box for SPSS to Save
standardized values as variables. SPSS creates columns with the standard score
equivalents of the columns you had before. In fact standard score (or z score) equivalents
of a set of scores = (the original score - the mean) / the SD. This reduces any interval
scale to a scale with scores of mean 0 and SD 1, ranging approximately between -3 and
+3. You can then safely add or average the three sets of scores to produce a composite
scale. For non-interval scales or mixtures of scale types other techniques are needed (On
combining scores see further my book chs 15-18).
A more sophisticated way of combining interval scores would be to use factor analysis to
pick out the most marked information shared between the scores for the three instruments
and use scores for each subject on that 'first common factor' as the overall measure (see
Factor Analysis). Obviously there is a good chance that the information shared by three
partly invalid instruments designed to measure the same thing will be more valid than the
information that differentiates them.