Validity

Simple approaches to checking on validity
0. Introduction
A test, questionnaire, observation sheet, coding system for think aloud data etc.etc. may
be reliable without being valid. An instrument is 'reliable' simply if it consistently works
to give the same results every time, if the same cases are remeasured. But the
measurement could still be consistently 'wrong' (biassed). Two markers may agree on the
scores they give a student for fluency in an oral test, but they may have both
misunderstood the instructions on how fluency is to be scored. Checking on reliability is
basically detecting the amount of random error in one's measurement of variables,
checking on validity is trying to detect constant error.
An instrument or other data gathering and measuring procedure is valid only if it
measures what it is supposed to measure, but one can only ask about that if it is in the
first place reasonably reliable. There is no point checking up on what variable an
instrument measures or records unless it is consistently measuring something. So validity
checks should in theory follow reliability checks, only for those instruments that do have
satisfactory reliability. In practice, however, as we see below, one type of validity,
Content validity, is often established before reliability is checked.
It should be noted that, in practice, there can be as much argument in a research project
about what the appropriate variables are that should be quantified, and how to define
them properly, as about the precise validity issues of whether this or that instrument
measures the variables correctly.
The notion of validity can also be applied more widely to entire research projects. It then
includes consideration of the all the things that make for good and bad sampling of cases
(external validity), and control of unwanted variables that may interfere with results,
design, and so forth (internal validity) as well as validity of the measuring instruments
themselves. However here we focus just on the validity of the measuring and recording
instruments - fundamental both to research and to pedagogical uses of such means of
quantifying data, and a topic often more written about in the literature on testing than in
that on research methods.
Traditionally four 'types of validity' of instruments are often recognised: really four ways
of checking if an instrument does measure what it is supposed to.
Some add further types, especially these two which I shall briefly describe but leave:
'Face validity'. This is really only concerned with how the instrument appears to
subjects and users - i.e. whether to the ordinary person it looks as if it measures
what it is supposed to - not the essential matter of what it really measures.
Nevertheless in practical research it may be necessary to ensure face validity as
well as 'actual' validity. Especially in pedagogical contexts teachers and learners
will expect a grammar test to look like a grammar test.. to them.
'Consequential validity'. This refers to actual and potential outcomes of use of an
instrument, rather than the instrument itself. It is a controversial concept in that
where and when attention to the social and political ramifications of test use should
be addressed is arguable. I.e. an instrument might be said to have good
consequential validity if it prompted beneficial backwash/washback: i.e. if teachers

started teaching students specifically preparing them to take the test or whatever,
and this actually had a beneficial effect on teaching. However, consequences of use
of an instrument may often be seen as not really due to the instrument itself but to
how people use it. Still it may need to be considered where some obvious
consequences occur. This will tend to be for tests used pedagogically rather than for
research.
Returning now to the traditional core ideas about validity... In validity discussion it is
necessary to be clear about where one is talking about the variable a particular instrument
measures, and where one is talking about the targeted variable - the 'thing' or conceptual
variable one is trying to measure, since invalidity arises precisely where these two
variables are not the same. A term often used to distinguish the latter (the targeted
variable) is 'construct' (or sometimes 'trait'). So, for example, many applied linguists have
a mental conception of a variable they call 'communicative competence': that can be
termed a construct which they define in various ways (e.g. Canale and Swain in AL1,
Bachman, etc.) and then set out to measure with this or that instrument (each of which is
usually made up of several sub-instruments). Argument ensues not only about the
definition of communicative competence in the first place (i.e. theoretical discussion
about how to define the construct) but also about whether this or that instrument
measures it validly or not.
The four key aspects of validity one can check on are: content or descriptive validity,
construct validity, concurrent validity and predictive validity (the last two also being
referred to also collectively as criterion validity... but nothing to do with 'criterion
referencing' which we met earlier). Construct validity is misleadingly named, since, as
just said in the last paragraph, the notion of a 'construct' is present in all validity work, not
just that where one is assessing 'construct validity'. Note also that in practice not all
instruments are able be checked for all four types of validity: it depends on the nature of
the instrument and other factors - criterion validity is particularly restricted to instances
where a criterion instrument exists (see below).
An analogy may help introduce these types of validity. Suppose you are into home
cookery, and want to buy an electronic weighing scale to measure your ingredients
properly. You go to some shop where there are several on display. You see one you like:
how do you evaluate it before buying it?
1) You might just examine it and see if it seems well made. Is it made of poor quality,
easy to break plastic? Does it have a big enough bowl for the amount of flour you
might want to weigh? You might turn it on and see if the display showing the weight
seems likely to be misread or not, and so on.
2) You might try it by putting something on it, like a pound coin from your pocket,
and seeing what weight it records. You take the coin off, put it on again, and see
whether it comes up with exactly the same weight again, as you would expect.
3) You might try it by putting your coin also on the best (most expensive) scales in the
shop and seeing if the scales you have your eye on comes up with the same weight.
4) You might try a few experiments, like seeing if two pound coins are recorded on
your scales as weighing twice the amount of one coin, as you might expect, or putting
a 500gm bag of sugar on it to see what the scales show.

One of those is a check on reliability not validity: which one?
The other three are in effect checks on construct, concurrent and content validity:
which is which do you think?
You may say that when buying a weighing scales you might actually be more concerned
with other things than the above, e.g.
5) You might judge it by whether it looks cool and the colour would match the other
accessories in your kitchen
6) You might consider if it seems too complicated and technical for your everyday
cookery needs and choose the scales most like one you have used before for ease
7) You might be on a budget and be concerned mainly with price
8) You might choose one 'Because my friend Lesley has one like it and says it always
works well'.
Some of these too have counterparts in measurement for research, and might be regarded
as matters of validity in a looser sense. It is often a good idea, for example, to use tests,
elicitation procedures, etc. that you and the subjects are familiar with (6) and do not
introduce complications that may distract them. If the procedure or whatever has to be
unfamiliar then of course you give them ample practice first on items that you don't
score. Ease of use of an instrument should not be allowed to lead to invalidity, however,
as it may if one simply does not bother to measure things because they are too 'difficult'
to do. For instance, an all-round measure of language proficiency should measure oral
ability and extensive reading ability, but of course these are difficult and time consuming
to include compared with grammar and intensive reading ability.
Price (7) also has a counterpart in research, where the cost is usually in time. It may not
be practically possible to measure your subjects' English proficiency directly with a
recognised test like FCE that you administer yourself, which is most likely to be valid, so
you have to rely on something less - such as a quick cloze test, or the school record of
English exam scores last May....
History of prior use of an instrument (8) is interesting, as it is often relied on by
researchers. E.g. 'If Gardner and his students used this questionnaire in published
research in 1979, 1985 and 1991, it's good enough for me'. This may be OK, but often if
you look closely you may find that the validity (and reliability) was not actually checked
all that thoroughly in the first place, and what might be a duff instrument has become
accepted simply through the number of times it has been used! Also, even if it was
properly validated in the first place, if you are going to use it in a different
country/culture from where it has been used before, or at a different proficiency level, or
with non-natives instead of native speakers, and so on, it probably needs revalidating.
For a fuller account of validity in general see my book ch21-23. That focusses more on
the sources of invalidity, and hence what precautions one can take when designing and
using one's instruments in the first place. The account here focusses more on aspects
where statistics can be used to help check on an instrument later (e.g. in a pilot study).
Content validation involves little stats at all.
Criterion validation procedures mostly use the same "agreement" statistical

measures as the basic ones seen in reliability study - especially correlation.
Construct validation may also use these, but can involve in the end all kinds of
research methods and any of the full panoply of statistical methods of analysis
thereof. Hence it is not possible here to attempt a systematic overview of all
statistical measures used in validity study. Instead I offer a few typical examples,
to give something of the flavour of the statistics that may commonly be met.
1. Content or descriptive validity, and item analysis again
This is a bit like (1) in the Introduction. Many instruments involve stimulus texts, words,
items etc. which are supposed to represent some variable and are somehow used to elicit
or categorise responses from subjects. For example, a set of agree/disagree items such as
'When I speak in the English class I feel self-conscious' is used collectively to measure
the variable 'English class anxiety'; or a set of twenty pairs of opposites and twenty pairs
of synonyms is used to represent values of an IV 'type of meaning relation' in a
psycholinguistic experiment to see if similarity of meaning is identified faster than
oppositeness of meaning; or a single questionnaire question is used to measure the age of
subjects.
The main practical way to check on the validity of the content of an instrument is to get
some 'experts' to judge whether the content of items, questions etc. reflects the intended
variable or not. In the case of linguistic elements like opposites and synonyms, of course,
there are books of these that one could refer to, as there is a fairly finite and definable
population of these in any language. Or one asks a linguist. Anyway, linguistic units,
structures etc. form definable populations which people can have in mind as they judge
particular set of items for an instrument. But the instrument has to have a declared
definition of the target construct before anyone can judge these things: if an expert is
judging a grammar test, he/she need to know clearly how the construct 'grammar' is
delimited for the purpose of this research (e.g. does it include or not wordformation,
linkage between sentences...?).
Would a test of achievement or a test of proficiency be easier to assess the content
validity of? Why?
Who are the obvious experts that you have access to to ask to check the validity of
your instruments?
In the case of class anxiety it is not so straightforward. There is not so obvious a
population of 'ways in which class anxiety shows itself' that one can have in mind as one
judges an instrument. Rather you have to rely on people's mental conception of what
anxiety in the English class is, and their judgment of what specific statements would
reflect possessing or not possessing it. An appropriate expert here might be a teacher,
though one might feel that learners similar to those one intends to measure later with the
instrument might also be able to suggest whether your items really are related to class
anxiety as they experience it, in their cultural context etc.
People similar to the intended subjects can sometimes also be brought in when making an
instrument in the first place, incidentally: if you were devising a set of items to reflect
English class anxiety from scratch, rather than using an existing set, you could elicit
statements from some learners about feelings they have during English classes and select
from those a set to try out as an instrument, or get learners to comment on your items in a
pilot study.
The only problem with using people who resemble future subjects, rather than 'experts',
as judges of validity is that if some complex theoretical concepts are involved in the
definition of the targeted variable, you cannot expect people who resemble subjects to
judge directly if the items relate to it, though they may still offer comments that help you
judge if the items are in fact relevant. Even what might seem like the everyday concept of
'anxiety' has behind it a (mainly psychological) literature which distinguishes for example
'trait anxiety' (which is a permanent feature of a person's personality) from 'situational
anxiety', and so on which one cannot expect subjects to take into consideration. Other
variables may be even more arcane from the point of view of the ordinary subjects, and
even some applied linguists! (e.g. field dependence, integrative orientation,
metacognitive strategic competence, etc.).
Where relevant, the content of an instrument should pay due attention to/be compatible
with any relevant theory - whether linguistic, psycholinguistic or whatever. If your
instrument is a test designed to measure whether people have acquired the INFL
constituent in English or not, then you and the judges need a grasp of the relevant parts of
Chomskian grammar to see if indeed the items relate to all and only aspects of grammar
that belong in INFL. If you are using a set of categories to classify strategies you have
identified in think aloud data, then in establishing a suitable set of categories to use you
need to be aware of the relevant discussion in the literature of how to define strategies,
what classifications have been used before, etc. This is where the expert judgment comes
in.
Strictly there is a difference between what one should be judging as suitable content for
an instrument with absolute value as against one with relative interpretation. Take a test
of knowledge of English phrasal verbs. To check its content validity first of course one
has to be clear how 'phrasal verb' is being defined - in the narrow linguist's sense, distinct
from prepositional verbs like look at - or in the broader teacher's sense, which includes
both? Then, if the test is supposed to record the absolute (criterion referenced) construct
of how many phrasal verbs some subjects know, one needs to check that the test includes
a random sample from the entire set of phrasal verbs, as defined. If, however, the test is
intended only to measure a relative construct - who knows more phrasal verbs than who
(relative/norm-referenced measure) - it is sufficient to check that all the items really are
about phrasal verbs, with no other types included, and that a reasonable range of them has
been included: the sort of considerations discussed under reliability and item analysis for
norm-referenced tests would then be involved as well, and they would ensure that the
verbs included are in fact of moderate difficulty for the targeted subjects (excluding ones
that are too easy or too difficult). Often for research purposes the relative measure is most
suitable: if one needs to test subjects' reading ability for a study, it is usually in order to
distinguish 'good' from 'bad' readers purely in the sense of 'relatively better' and
'relatively worse' ones, so there would be no point in choosing a text for the test which
was too easy or too hard for them.
PhD students don't always get anyone other than themselves to check their instruments
for content. At least the supervisor should be involved as a second expert, especially as
you don't even have to pilot the instrument to do this! And if piloting an instrument,
always get as many open comments as possible from the subjects used about their
experiences using it, and what they thought it was measuring, as these may give valuable
hints as to whether some misunderstanding has occurred which might affect validity. In
particular you may discover that they are using test-taking or task-performing strategies
which mean that your instrument is measuring their successful use of these strategies as
much as their possession of whatever ability the test is testing (see McDonough, Strategy
and Skill in Learning a Foreign Language ch6). Or it may emerge that they are using
knowledge that you had not intended to be measured by this instrument: e.g. imagination
in written task which you intend merely to measure overall text organisation, or real
world knowledge in test items you had intended to test just vocabulary knowledge.
Can you think of some examples of instruments where some variable other than
the intended linguistic one would often get measured
There have been some more systematic attempts by researchers to investigate the validity
of various types of test by think aloud research. This compares the strategies test-takers
use when taking a test with the strategies they use when doing the 'same thing' otherwise.
E.g. do learners doing a conventional reading comprehension test where they read a
passage and answer multiple choice questions read the passage using the same strategies
that they use when reading a passage in a non-test situation? You can guess the answer
(see McDonough ch6 again). If the notion of the 'content' of an instrument is allowed to
include the skills used to tackle the items as well as what is in the items, as it well might,
this is all loosely part of content validation.
There is little statistics involved in any of this. To demonstrate content validity of an
instrument you are using you can of course record the % of items in a test or inventory
that all judges agreed represented the targeted variable. Or on any single item or an
instrument as a whole one can record the % of judges who agreed it was valid. However,
more usually one uses the information at once to improve the instrument before using it
for real (and reports on this in one's write-up). Typically any item or instrument as a
whole that at least one judge objected to would be revised. In other words, this often
involves a form of 'item analysis'.
How does this differ from the sort of item analysis that often goes hand in hand
with the assessment of internal reliability of an instrument?
2. Concurrent validity
Checking up on concurrent validity means checking one's instrument by comparing it
with another one that you know is (more) valid (the 'criterion' measure or instrument). An
obvious limitation is that you cannot use this validation method on any instrument where
no known 'better' instrument exists to be compared with! In much language research,
especially perhaps psycholinguistics, one is measuring things that have not been
measured before. And if a better instrument does exist, in research work the suggestion
will always be there: why not use the known 'better' instrument in the first place, instead
of this other one? Maybe the reason will be cost - compare (3) and (7) in the Introduction.
Furthermore, there may be disagreement as to which is the instrument one can safely
assume is more valid. It is often assumed that 'direct' measures (more naturalistic
methods of measurement) are more valid than 'indirect' ones (e.g. artificial tests). More
elaborate extensions of this concept of validity form the multimethod part of
Multimethod Multitrait studies.

To demonstrate concurrent validity one has to try out one's instrument, and the criterion
one, on the same people (typical of one's target subjects) and show that there is high
agreement in the results. This involves piloting, similar to that for reliability checking.
The distinction between relative (norm referenced) and absolute (criterion referenced)
measures applies, though it is the measures of agreement appropriate to the former that
one mostly sees used.
2.1 Concurrent validity: variables on interval scales
Fischer (1984) wanted to quantify the communicative quality of the writing of learners of
French. He developed an instrument, to be used primarily by language teachers, based on
combining scores for three separate variables which he considered prime indicators of the
target construct. Measurer ratings of 'pertinence' and 'clarity' (each on a scale 0-6) are
straightforwardly added, and a count of mean number of errors per clause, as a measure
of 'structural accuracy', is added as a negative score - i.e. in effect deducted. I.e. he works
with a formula for communicative quality as follows:
Communicative quality = pertinence + clarity - structural inaccuracy
One way to evaluate this would be to consider its content validity:
Does your expert judgment tell you that these three things are the key ones in any
theory-based definition of written communicative competence?
However, Fischer was concerned with demonstrating concurrent validity. In order to do
this he had it applied to18 typical texts written by learners as a communicative task (job
applications). These he also got 53 native speakers of French to rate globally on a scale 024 running from 'no message gets over' to 'coherent and contains content holding reader's
attention', these glosses of scale points being presented in French, of course. The average
(strictly, median) native speaker rating was chosen as the criterion, assumed valid,
measure for each text. Since teachers and researchers maybe can't always use native
speaker rating direct to measure communicative quality of writing, he wanted to show
that his complex instrument, based on measuring three things and combining them, was
as good as a NS rating.
Fischer then examined statistically concurrent validity (and some aspects of construct
validity - see later). Although you could make out a case for both the composite
instrument itself, and the criterion instrument it is being compared with, having some
'absolute' (criterion-referenced) interpretation, in fact Fischer used the Spearman rho
correlation - a 'relative' measure of agreement which treats the scores as rank ordered.
Correlations were calculated not just between the combined communicative measure and
the native speaker ratings, the key pair for concurrent validation, but also, as is common
practice in such instances, between all possible pairings of these and the separate
measures of the three component variables combined in Fischer's instrument. Thus a
whole set of correlation coefficients emerges, which in validity studies, as elsewhere, are
usually displayed in a two-dimensional table called a 'correlation matrix', like the one
illustrated below (adapted from Fischer). Of course if the instrument being used had not
consisted of several subinstruments, there would have just been one correlation to
examine.
The columns are understood as labelled the same as the rows of the same number, so the
first figure in the top line (.241) is the correlation between the pertinence ratings of the
texts and the clarity ratings. This is in fact a correlation internally between components of
the instrument being validated, not directly throwing light on the concurrent validity of
the instrument as a whole. In such tables the diagonal is usually left empty because the
correlation of a measure with itself is +1, though sometimes this space is used to display
the standard deviation of scores on each variable. And normally only one triangle is
displayed, not the full square of correlations, because the other triangle would merely
repeat the same information over again (SPSS however gives you everything twice).
1
1 Pertinence rating
2 Clarity rating
3 Structural accuracy count
.241
-.019
.408
.459
.733
.933
.858
.805
.723
4 Overall communicative
instrument (1-3 combined)
.953
5 Native speaker rating

(= the assumed valid criterion)
From this we see that the highest correlation, close to +1, is in fact between the overall
communicative instrument and the native speaker criterion. This is strong evidence for
concurrent validity of the former (.7 or more is often regarded as 'good' in concurrent
validity discussions). But remember that, having used Spearman rho, this only extends to
showing that Fischer's instrument puts cases in almost exactly the same "order of merit"
as native speaker ratings. It says nothing about whether any absolute levels of
communicative competence ascribed are the same. To examine the latter, the score scales
would have to be equated (e.g. both expressed as %) and calculations e.g. of the average
absolute difference would show up if the instrument is also concurrently valid in this
absolute sense.
Among the components of the communicative instrument, which rating correlates
least well with the criterion measure?
If one tried to reduce Fischer's measure to a % score, what problem would one
face?
2.2 Concurrent validity: variables on category scales
For an example of concurrent validation involving category scales, compare Black and
Dockrell's data (1984 p115, but note it is not interpreted by those authors exactly as here).
This can be seen as in effect a small scale check of the concurrent validity of teacher
judgment of 23 learners' mastery of the dative plural in German, using a 12 item
criterion-referenced multiple choice test as criterion. Beware the two different uses of the
word 'criterion' here. A score of 75% or better was chosen to indicate mastery on the test,
and the teacher judgment was a direct categorisation of cases as masters and non-masters
of this aspect of German, based on his general experience of the learners in class and past
grades (Note 'master' here means one who has acquired the relevant aspect of language,
not a teacher). Again the natural way of examining and quantifying the agreement is by a
method already seen for reliability - a contingency table display and the measure of
agreement suitable to data with absolute value - % agreement (or better Cohen's kappa
coefficient):
Teacher judgment
CR Test
Masters
Nonmasters
Masters
Nonmasters
13
What is the percentage agreement here?

What would Cohen's kappa be for this data?
We can see that the agreement is quite modest, not strong evidence for the concurrent
validity of the teacher judgment in absolute terms, on the assumption that the test is valid.
We can gain useful further information by examining the figures on the table for cases
disagreed on. In fact the disagreement lies exclusively in the teacher overestimating the
absolute competence of the learners and judging thirteen of them to know the German
dative plural who did not. When a categorisation of the 'pass-fail' or 'yes-no' type like this
is analysed in this way against a criterion categorisation, it may be of some importance
whether 'false positives' or the reverse predominantly occur if pedagogical decisions are
made on the basis of a partially valid instrument. In the present instance clearly the
teacher judgment errs in favour of the former rather than the latter.
Do you think it is reasonable to regard 75% correct answers on a test as evidence
of 'mastery' or having acquired something?
Do you think it is possible to argue that the teacher judgment should be
considered the criterion measure (i.e. the one with assumed validity), and that we
should assess the validity of the test against that judgment rather than the other
way round?
Suppose we sit on the fence and say we aren't sure which of the two measures can
really be assumed to be valid. Does the above analysis of the agreement between
them then tell us anything about the validity of either measure?
How does the above analysis differ from one of reliability?
The above approach to measuring concurrent (and predictive) validity for categorisations
seems most simple and appropriate for pairs of variables recorded in the same categories,
and with some absolute value, though other, correlation-type, measures are also seen used
- e.g. phi (Tindal et al. 1985), tetrachoric r (Arrasmith et al. 1984).
3. Predictive validity
This one was not illustrated in the Introduction, as it does not really apply to weighing
machines, and indeed not to many instruments used in language research. In fact one can
only check predictive validity for an instrument that claims to measure a variable that by
its nature has some forward-looking element to it by definition, or can be assumed to
predict later performance on some other variable. Tests of language learning aptitude,
reading readiness and the like are the most obvious examples. They can then be checked
by seeing if people who do well on them also do well at a later time on whatever the
instrument claimed to predict (i.e. actually learning a language, reading etc.). The latter,
outcome, measure serves as a criterion to assess the former, predicting, measure just as
the known more valid measure serves as criterion to assess the questionable one in
concurrent validation.
3.1 Predictive validity: variables on interval scales
Predictive validity is often quantified by a correlation measure of some sort, as the
commonly used instruments that it applies to are tests yielding interval scores. Thus
Curtin et al. (1983) report testing out Pimsleur's foreign language learning aptitude test at
University of Illinois High School. Correlations between the aptitude test scores and
grades obtained later at the end of the first year of learning a foreign language were not
very impressive:
French
.35
German
.305
Russian
.174
Latin
.449
Though nowhere near +1, at least the correlations are all positive - there were no
languages for which higher scores on the aptitude tests foretold lower learning
achievement. Those who did relatively well on the aptitude test were most likely to be
later relatively successful at learning Latin - by the means of instruction used in the above
institution, of course. As usual these correlations tell us nothing about any absolute levels
of achievement in the languages concerned. Further information might be gleaned from
an examination of the four scatterplots for the aptitude and specific language achievement
scores. For instance it might pay to see which cases were "spoiling" the lower
correlations and whether they had anything else in common.
3.2 Predictive validity: variables on category scales
A criterion-referenced 'mastery' test which purports to predict, from attainment of a preset
accept or reject level on the test, success or failure on an ensuing course, would best be
analysed like the Black and Dockrell example discussed above, together with a 'false
positive analysis' etc. In practice the problem may be that you will get followup
information only on those who 'passed' the predictive measure, as only those were
allowed to go on to learn, so you will not be able to complete the contingency table. An
example where all the figures are available is afforded by Moller (1975), who examined
the predictive validity of the Davies English test, administered before embarkation to
foreign students coming to the UK, as an indicator of adequacy of their English for
postgraduate courses in a variety of subjects. (The Davies test was popular in the days
before the development of IELTS). The followup criterion here was UK supervisor rating.
Statistics as in 2.2 apply.
Pre-embarkation test
Supervisor
Adequate
Adequate
Inadequate
85%
10.5%
Inadequate 3.5%
1%
As can be seen, the percentage agreement is 86%, so reasonable predictive validity is

achieved. But the division of the mismatches is such that it seems fortunate that those that
the test rejected were nevertheless still allowed to go!
4. Construct validity
Construct validation again requires the instrument in question to be used, maybe in a pilot
trial. Essentially it involves examining relationships between what are clearly measures
of different variables, one of which one is interested in validating, to see if the
known/assumed relationships (whether positive or negative or neither) are found. Thus it
contrasts with concurrent validation where one is concerned with comparing what are
regarded as different means of measuring 'the same' variable (and of course only positive
relationships are expected). Having said that, there are subtle differences in the ways
some writers use the term 'construct validity' which I will not consider here.
Construct validation can take on many forms, but we will treat it as of two types - that
where one looks at the internal structure of a complex instrument such as Fischer's above
(which only applies to instruments made by measuring several variables and combining
the scores, of course) and that where one simply looks at the relationships between an
instrument to be validated and other measures of other variables whose validity is not in
question. The last is applicable everywhere and can involve almost any statistics. It
essentially is exactly the same as pursuing substantive classical research, except that the
assumptions are altered. In normal research you assume the measuring instruments are
valid and use them to measure relevant variables so you can find out about the
relationship between one variable and another (which is not known for sure in advance,
though you may have a hypothesis about it of course). In construct validity work you
assume all the measuring instruments except one are valid measures of their intended
variables and know from previous research or common sense what the relationship
between one variable and another is. You use this to assess the validity of the instrument
in question. If the assumed relationship between variables is found, that suggests the
instrument has construct validity. This was illustrated by (4) in the Introduction.
Normally you use weighing scales with the assumption that the scales are correct, but
where you are uncertain about what things weigh. To check scales you can reverse this,
by weighing things with a known weight or known difference in weight to check if the
scales are correct.
4.1 Construct validity within a complex instrument
We can illuminate the examination of construct validity within a complex instrument a bit
more by examining the Fischer example study again (see 2.1). Many instruments used in
language research are not complex in this way however. Columns 2-4 of the matrix in the
above section show the internal analysis in correlation terms of Fischer's instrument.
Examining these is effectively one way of checking 'construct validity' since the three
component measures were chosen, with some theoretical backing, as supposedly
quantifying different complementary variables which make up a meaningful single
construct. We therefore expect them not to agree much with each other but to agree with
the overall measure derived from them. Column 4 shows the correlations of each
component variable with Fischer's overall measure. Unlike all the other correlations,
which are "free" to arise or not depending on empirical fact, there is bound to be some
degree of correlation here because of the a priori connection between any overall measure
and its component variables. To confirm validity we expect pretty high correlations,
which we get for two out of the three components in fact: we can see again that the
pertinence rating correlates less well with the overall measure than the other two
component variables do, and the clarity rating the best. Thus the clarity rating would be
the best choice as a single indicator for the instrument as a whole, if you so wished.
Generally you would hope for reasonably high correlations between component variables
and a combined measure made from them, otherwise it might seem that the theoretical
framework that led you to think of them as components of "one" variable was wrong.
Columns 2 and 3 enable you to examine the correlations just between the three grassroots variables themselves. In a composite instrument such as this you would not expect
high correlations here otherwise it might be argued that there is no real point in
combining three separate measures - one would do. In fact two of the three are quite low,
one being very slightly negative. That means those who do well on the pertinence rating
score if anything relatively low on the structural accuracy measure. Indeed correlationtype coefficients can extend all the way down to -1, if relatively high values on one
variable are scored by cases who score low ones on the other and vice versa. This on the
whole supports the logic on which the measure was established, based on the theory that
the three variables are more or less independent contributors to the one composite
construct. If more contributory variables had been measured, the statistical technique of
'factor analysis' could have been used to sort out which were the essential ones,
quantifying definitely distinct components of the composite measure.
4.2 Construct validity of an instrument in relation to other, different, variables
This type of construct validation is really the most fundamental and most widely
available type of all, though often overlooked by researchers who limit themselves to
pursuing the 'easier' content validation. It can in the end involve any of the standard
research designs. A simple example of its use is where, as part of a study of the
acquisition of INFL in English, say, a researcher includes learners of several proficiency
levels, and some adult native speakers of English. All are tested with a test designed to
show mastery of this aspect of English grammar. The hypotheses naturally are that higher
proficiency levels will show greater mastery and NS the highest of all. However, from the
point of view of substantive research these hypotheses are of the type sometimes called
'straw men'. It seems somehow 'obvious' that native speakers will do well and that
learners will do less well, and that learners of higher proficiency will do better than those
with lower proficiency. Such hypotheses really cover things we know already and it
hardly seems worth doing a study just to show them to be true. Good hypotheses to
follow up on for a research project are not usually ones that everyone already has proved
several times over, but something more on the edge of knowledge. So one would hope
this researcher has some other hypotheses or research questions as well - ones which
assume a bit more and focus on something less well known (like perhaps ones about
which specific manifestations of INFL are acquired first, or whether they all emerge
together). But if a research project does include some 'straw man' hypotheses, this does
have a value as a form of construct validation. Precisely because the relationships are
already known, the new instrument to measure mastery of INFL can be checked for
validity by seeing if it yields the assumed relationships. This employs the 'argumentation
in reverse' that is typical of construct validation. In other words, when the data is
gathered, do we find an increase in mastery at higher proficiency levels, with near perfect
scores for native speakers? If we do, that suggests our INFL test is valid. If not, we would
wonder why and check the test (though if this was all done in the main study, not a pilot,
it is too late to change it!).
What statistics would probably be involved in that example?
To further characterise what goes on in construct validation I shall sketch some other
approaches that could have been used in the Fischer example above, though they were
not. Typically they would have concentrated on external relationships of Fischer's overall
instrument with measures of what are assumed to be clearly separate variables, neither
components of Fischer's nor criterion measures of much the "same" construct. The
criterion measure - the native speaker rating - could also be pursued in the same way. For
either instrument this could be done either again via correlations, or via research
involving comparisons of groups or conditions similarly.
4.2.1 Construct validity: correlational design
The correlational approach would again generate a matrix of correlation coefficients, this
time between Fischer's instrument and several other different variables. Here you would
look to see if the correlations obtained accorded with what theory would predict the
relationship to be. For instance written communicative competence would not be
predicted to correlate particularly with explicit metalinguistic knowledge of linguistic
terminology, or with intelligence, so you would expect near zero coefficients. This is
sometimes called 'discriminant validation'. On the other hand you would assume a
positive relationship with oral communicative competence. Checking on this sort of
relationship is sometimes called 'convergent validation'.
In a different realm, Fasold (1984 p120f) reports the convergent approach being used to
validate census reports on native language in Quebec. You would a priori feel able to
assume that the proportions of adults reporting themselves as mother tongue French or
English speakers to match closely the proportions of children independently recorded as
enrolling in Catholic and Protestant schools, and indeed this turns out to be so. So the
trustworthiness of the census is supported. The divergent approach is an essential
component in the MTMM method of validation (my book 21.2.3).
More elaborate studies along these lines use factor analysis which can be seen as being
able to analyse a whole set of correlations together rather than pairwise and identify
which groups of variables are clubbing together to place cases in a similar order with
similar spacing of scores - i.e. mutually correlate highly. In particular, 'confirmatory
factor analysis' is used where you have a prediction to test about what the correlations
should be, as you will have in a proper construct validation exercise.
4.2.2 Construct validity: non-experimental independent groups design
The group comparative approach was illustrated by the INFL example in 4.2. Fischer's
instrument could similarly be checked by trying it with native speakers as well as
learners. This would be permitted since it was not specifically norm-referenced to
learners only. Something similar was done for the TOEFL (Angoff and Sharon 1971). On
instruments such as these, if it emerged that a group of learners did better than a group of
native speakers, clearly this would be counter to assumptions and you would question the
validity of the measure. Here the EV - the categorisation of cases as 'native speaker' or
'learner' - is the "other" variable that the measure in question (used as DV) is checked for
its relationship with.
Additionally for instruments made up of many items, Rasch analysis (see reliability) can
be used to show if items are got right more by successively higher levels/ages as they
should.
4.2.3 Construct validity: experimental designs (independent groups and repeated
measures)
Instead of using existing characteristics of cases, such as being a native speaker or not,
often it is suitable to 'make' an EV in a fully experimental investigation of construct
validity. So you could use Fischer's instrument on two or more groups of cases after they
are exposed to different conditions or treatments which you feel confident on a priori
grounds will affect their written communicative competence. E.g. you teach one group
communicative writing but not the other. If Fischer's instrument duly records the assumed
difference, then that supports its validity. The term 'treatment validity' is sometimes used
for this specific approach.
The same 'treatment' approach is widely used in validating criterion-referenced
achievement tests in pedagogical contexts (Black and Dockrell 1984 p92). You
administer a questionably valid test of, say, apologising appropriately in English, to a
group who have been taught the content on which the test is supposedly based, and a
group who have not. The natural assumption here is of course that the former group
should score higher than the latter, if the test is valid. A variant of this approach would
rather involve administering the test twice to the same group, before and after instruction,
to see if the assumed improvement registers or not. Not dissimilar also, is the following
procedure, often done quite informally. In a multi-item test or inventory you intersperse
'control' items on which you have a priori assumptions of what the responses should be.
For example, if cases are rating the 'idiomaticity' of some phrases you may include some
that are clearly non-idiomatic as a check on whether the cases are using the same
definition as you.
4.2.4 Conclusion on construct validation statistics
At this point it is worth drawing attention to some differences between the use of
correlation coefficients to quantify validity and their use in connection with reliability.
First, we have just seen that near zero and negative values may well arise in validity
work: these would be most unexpected in a reliability study (or indeed concurrent or
predictive validity study). Second, correlation matrices are most often seen in a validity
study. In reliability work where you have, say, a set of three or more repetitions of a test,
you would probably try to quantify the overall agreement between them rather than
examine the agreement between each possible pair of occasions when the test was
redone. Finally, in the study of validity you are studying the relationships between
different variables and so often comparing instruments yielding scores on different scales
of the same general type. For instance, Fischer's instrument has a maximum score of 12,
the criterion rating one of 24. That again does not arise in reliability checking as by
definition the same measure on the same scale is repeated in some way. This causes no
problem for correlation-type coefficients, incidentally, since they only quantify agreement
in a relative sense and are largely unaffected by differences in score level on the two
variables compared.
Further, something I have generally glossed above but which must be considered when
interpreting the correlation coefficients is the number of cases involved. To oversimplify,
what constitutes a correlation that is big enough in a positive or negative direction to be
taken as marking any convincing relationship (versus no marked relationship at all)
depends on the number of cases. The threshold values can be found on significance tables
in basic statistics texts. The threshold value for Spearman rho for 18 cases is in fact .399,
so we see that two of the correlations obtained for Fischer's instrument are not high
enough to indicate any real relationship at all (using the customary .05 level of
significance).
In a similar way, the key statistical information that emerges from the group comparison
and experimental methods is an indication of probability, signified by 'p', and perhaps
other measures of amount of relationship like eta squared (rather than a correlation
coefficient). These can be arrived at by any number of statistical means appropriate to the
particular design used. As customary, if p is less than 0.05, researchers would generally
assume that a difference in DV scores worth paying attention to has been demonstrated
between the groups or conditions. Technically the difference is said to be 'significant'. See
further a text such as Langley (1979) or Rowntree (1981) for a simple account of the
reasoning behind significance tests and p values and their interpretation. In the present
application, then, a p near zero would indicate a relationship between the EV variable and
the one whose quantification is in question. You would have to further look at the scores
to see if the relationship was in the assumed direction - e.g. that native speakers did do
significantly better rather than worse than learners on Fischer's measure, and so support
its validity.
All these approaches involving validation against measures of clearly different variables
share the characteristics of being like normal research designs, but done "back to front".
Indeed any of the many designs of investigations that are used in research can be used in
this back-to-front way as validity checks. To reiterate: what I mean by "back to front" is
as follows. Normally in language research you assume that the variables that are of
interest can be quantified validly and seek through correlation study or experiment etc. to
find out something about relationships between them. In the present situation you rather
have to assume you know the relationship between the variables, and that measures of
variables other than the one whose validity is in question are valid, in order to find out
about the validity of a problem instrument.

A maths analogy is valuable. We know the formula that the circumference of a circle is
equal to pi times the diameter (C=d). So regularly when we know the diameter of a
circle we use the formula to calculate the circumference (pi of course always = 3.1416).
However, we can also use the formula back to front - i.e. if we happen to know the
circumference we can divide by pi to calculate the diameter. What we cannot do is find
anything out if we are sure neither of the circumference nor the diameter, or indeed if we
are not sure that the relationship expressed in the formula is quite right either! Similarly
in the present instance we can conclude nothing about validity of the quantification of
one of the variables in the above pieces of research unless we are sure of all the other
elements - which can be a big assumption.
For instance in the apology example above (4.2.3), we have to assume the teaching really
is effective: otherwise a contrary result could be interpreted as evidence of faulty
teaching/learning rather than a faulty measure of competence in apologising. In the case
of Fischer's instrument we would have to be sure of our theory that predicted no
relationship between written communicative competence and knowledge of terminology,
and so forth. The reality of course is that usually you are not sure of several aspects of a
piece of research. Hence acts of faith some into play in construct validation just as they
tend to, in the matter of the assumption of validity of the criterion instrument in
concurrent validation.
5. Conclusion on validity in general
Checking validity is not to be overlooked just because reliability is usually easier to
check, and a more mechanical task to check.
The best thing to do of course, both with reliability and validity, is to be fully aware in
advance of the kinds of factors that cause it and make sure they don't get into the
instruments, or the way they are administered and scored, in the first place! Quality
assurance beats quality control (See e.g. my book).
Often in research one is using/piloting for the first time instruments that one has created
especially to quantify some very specific variables of interest, in order to test a new
hypothesis. Hence you may be heavily reliant on content validation as there are no other
measures of the same thing of known validity to use for concurrent validation, and you
don't feel able to assume any relationships with other variables that can be used as a basis
for construct validation. One protection is to devise several different ways of measuring
'the same thing', none of which have very well established validity, and average the
results: this may yield better validity than a single measure, and is implicit in the notion
of 'triangulation' associated especially with non-classical research (i.e. the idea that you
use several different ways of investigating the same thing, both qualitative and
quantitative perhaps, in order to illuminate it most effectively).
Can you think of more than one way of gathering data and quantifying the
frequency with which subjects use different reading strategies?
Can you think of more than one way to measure if some learners have mastered
the third singular -s on verbs in English or not?
Portfolio assessment in teaching contexts as a form of triangulation?
If you do use in this way several methods of measuring the 'same thing' (or at least of
what you intend to be the same construct) of course it looks a bit like concurrent
validation, but is actually quite different. You might for example measure reading ability
both with a cloze test (filling gaps in a text), and a multiple choice test of comprehension
following reading a passage, neither of which you are sure of the validity of. You can
even compare the results (e.g. with correlation) to see how far they agree, in addition to
combining the scores to get 'the best measure'. But this is not concurrent validation unless
you are able to make the assumption that one of the instruments is definitely valid. If the
two measures agree, it is encouraging but does not actually confirm their validity if you
know the validity of neither in any other way. There could be a constant error shared by
both, just as in reliability analysis. And if there is a big difference in the results, you don't
know if it is because one is valid and the other not, or because both are invalid in
different ways, or what. Gathering qualitative data in the form of comments from the
subjects afterwards may help decide. But there is always the risk that you create a set of
scores with less validity by combining the results of a highly valid instrument with those
of an invalid one.
Probably the best way to decide on what multiple ways of measuring the same thing to
use is to choose ones which appear to potentially involve rather different sources of
invalidity, in the hope they balance out. For instance, to get a balanced overall measure of
vocabulary knowledge you would combine tests with both written and oral and picture
stimuli, so that it is not invalidated by being partly a reading test, which could be the case
if you used only written stimuli. You need an awareness of sources of invalidity (see my
book esp. ch 21-23 and accounts of validity generally).
Investigation of validity (and indeed of reliability) can be a research topic in itself rather
than, as we have pursued it here, just something you do on the side of your 'real' research,
to check that the instruments are accurate. A particular kind of validity checking done as a
research project in itself is the 'multitrait multimethod' study (MTMM). This sets out to
systematically measure more than one construct ('trait') each by more than one type of
instrument ('method') in a parallel way. One then gets to see if the different instruments
produce more drastically different results than the different supposed constructs
(Bachman; for a clear example of its use see Gardner, Lalonde et al. The role of attitude
and motivation Language Learning 35).
When combining the results of different instruments, you need a bit of statistical
knowhow. We consider here interval scores with relative not absolute value. Just adding
the scores from three different reading tests will not be satisfactory if they are on different
scales, as the weight or contribution of each instrument will not be equal. Even if they are
all on the same interval scale, the three sets of figures may have different standard
deviations, which has an effect. If they are all at least interval, one method is to turn all
the scores into standard scores (also known as z scores), which puts them all on scales of
the same length with the same mean and SD. In SPSS this can be done by going to
Analyze...Descriptive statistics...Descriptives, and marking the box for SPSS to Save
standardized values as variables. SPSS creates columns with the standard score
equivalents of the columns you had before. In fact standard score (or z score) equivalents
of a set of scores = (the original score - the mean) / the SD. This reduces any interval
scale to a scale with scores of mean 0 and SD 1, ranging approximately between -3 and
+3. You can then safely add or average the three sets of scores to produce a composite
scale. For non-interval scales or mixtures of scale types other techniques are needed (On
combining scores see further my book chs 15-18).
A more sophisticated way of combining interval scores would be to use factor analysis to
pick out the most marked information shared between the scores for the three instruments
and use scores for each subject on that 'first common factor' as the overall measure (see
Factor Analysis). Obviously there is a good chance that the information shared by three
partly invalid instruments designed to measure the same thing will be more valid than the
information that differentiates them.

Validity

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Validity

Hochgeladen von

Copyright:

Verfügbare Formate

Simple approaches to checking on validity

consequential validity if it prompted beneficial backwash/washback: i.e. if teachers

a 500gm bag of sugar on it to see what the scales show.

Criterion validation procedures mostly use the same "agreement" statistical

Multimethod Multitrait studies.

5 Native speaker rating

What is the percentage agreement here?

Statistics as in 2.2 apply.

As can be seen, the percentage agreement is 86%, so reasonable predictive validity is

about the validity of a problem instrument.

Das könnte Ihnen auch gefallen