You are on page 1of 11

A Paradigm for Developing Better Measures of Marketing Constructs

Author(s): Gilbert A. Churchill, Jr.

Source: Journal of Marketing Research, Vol. 16, No. 1 (Feb., 1979), pp. 64-73
Published by: American Marketing Association
Stable URL: .
Accessed: 17/01/2011 03:22

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . .

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact

American Marketing Association is collaborating with JSTOR to digitize, preserve and extend access to
Journal of Marketing Research.
Measure and Construct ValidityStudies


A critical element in the evolution of a fundamental body of knowledge

in marketing, as well as for improved marketing practice, is the development
of better measures of the variables with which marketers work. In this article
an approach is outlined by which this goal can be achieved and portions
of the approach are illustrated in terms of a job satisfaction meosure.

A Paradigm for Developing Better Measures

of Marketing Constructs

In an article in the April 1978 issue of the Journal Burleigh Gardner, President of Social Research,
of Marketing, Jacoby placed much of the blame for Inc., makes a similar point with respect to attitude
the poor quality of some of the marketing literature measurement in a recent issue of the Marketing News
on the measures marketers use to assess their variables (May 5, 1978, p. 1):
of interest (p. 91): Today the social scientists are enamoredof numbers
Morestupefyingthanthe sheernumberof our measures and counting . .. Rarely do they stop and ask, "What
is the ease with which they are proposed and the lies behind the numbers?"
uncriticalmannerin which they are accepted. In point When we talk about attitudes we are talking about
of fact, most of ourmeasuresareonly measuresbecause constructsof the mindas they areexpressedin response
someone says that they are, not because they have to our questions.
been shown to satisfy standard measurement criteria
(validity, reliability,and sensitivity). Stated somewhat But usually all we really know are the questions we
differently,most of our measuresareno more sophisti- ask and the answers we get.
cated than first asserting that the number of pebbles
a person can count in a ten-minuteperiod is a measure Marketers, indeed, seem to be choking on their
of that person's intelligence;next, conductinga study measures, as other articles in this issue attest. They
and finding that people who can count many pebbles seem to spend much effort and time operating by
in ten minutes also tend to eat more; and, finally, the routine which computer technicians refer to as
concludingfromthis: people with highintelligencetend GIGO-garbage in, garbage out. As Jacoby so suc-
to eat more. cinctly puts it, "What does it mean if a finding is
significant or that the ultimate in statistical analytical
techniques have been applied, if the data collection
instrument generated invalid data at the outset?" (1978,
*GilbertA. Churchillis Professor of Marketing,University of
Wisconsin-Madison.The significantcontributionsof MichaelHous- p. 90).
ton, Shelby Hunt, John Nevin, and Michael Rothschildthrough What accounts for this gap between the obvious
theircommentson a draftof thisarticlearegratefullyacknowledged, need for better measures and the lack of such mea-
as are the many helpful commentsof the anonymousreviewers. sures? The basic thesis of this article is that although
The AMA publications policy states: "No article(s) will be
the desire may be there, the know-how is not. The
published in the Journal of Marketing Research written by the Editor
or the Vice Presidentof Publications."The inclusionof this article situation in marketing seems to parallel the dilemma
was approvedby the Board of Directors because: (1) the article which psychologists faced more than 20 years ago,
was submittedbefore the authortook over as Editor,(2) the author when Tryon (1957, p. 229) wrote:
played no part in its review, and (3) MichaelRay, who supervised
the reviewing process for the special issue, formally requested If an investigator should invent a new psychological
he be allowed to publishit.
test and then turn to any recent scholarly work for

Journal of Marketing Research

Vol. XVI (February1979),64-73

guidanceon how to determineits reliability,he would nate. Much more typical is the measurement where
confront such an array of different formulationsthat the XO score differences also reflect (Selltiz et al.,
he would be unsure about how to proceed. After fifty 1976, p. 164-8):
years of psychologicaltesting, the problemof discover-
1. True differences in other relatively stable charac-
ing the degree to which an objective measure of
behaviorreliably differentiatesindividualsis still con- teristics which affect the score, e.g., a person's
fused. willingnessto express his or her true feelings.
2. Differences due to transientpersonal factors, e.g.,
Psychology has made progress since that time. a person's mood, state of fatigue.
Attention has moved beyond simple questions of 3. Differences due to situationalfactors, e.g., whether
reliability and now includes more "direct" assess- the interviewis conductedin the home or at a central
ments of validity. Unfortunately, the marketing litera- facility.
4. Differencesdue to variationsin administration,e.g.,
ture has been slow to reflect that progress. One of interviewerswho probe differently.
the main reasons is that the psychological literature 5. Differences due to sampling of items, e.g., the
is scattered. The notions are available in many bits specific items used on the questionnaire;if the items
and pieces in a variety of sources. There is no or the wording of those items were changed, the
overriding framework which the marketer can embrace XOscores would also change.
to help organize the many definitions and measures 6. Differences due to lack of clarity of measuring
of reliability and validity into an integrated whole so instruments, e.g., vague or ambiguous questions
that the decision as to which to use and when is which are interpreteddifferently by those respond-
obvious. ing.
This article is an attempt to provide such a frame- 7. Differencesdue to mechanicalfactors, e.g., a check
markin the wrongbox or a responsewhich is coded
work. A procedure is suggested by which measures
of constructs of interest to marketers can be devel-
oped. The emphasis is on developing measures which Not all of these factors will be present in every
have desirable reliability and validity properties. Part measurement, nor are they limited to information
of the article is devoted to clarifying these notions, collected by questionnaire in personal or telephone
particularly those related to validity; reliability notions interviews. They arise also in studies in which self-ad-
are well covered by Peter's article in this issue. Finally, ministered questionnaires or observational techniques
the article contains suggestions about approaches on are used. Although the impact of each factor on the
which marketers historically have relied in assessing XO score varies with the approach, their impact is
the quality of measures, but which they would do predictable. They distort the observed scores away
well to consider abandoning in favor of some newer from the true scores. Functionally, the relationship
alternatives. The rationale as to why the newer al- can be expressed as:
ternatives are preferred is presented.
XO= XT + X + XR
Technically, the process of measurement or opera-
tionalization involves "rules for assigning numbers Xs = systematicsources of errorsuch as stable char-
acteristics of the object which affect its score,
to objects to represent quantities of attributes" (Nun- and
nally, 1967, p. 2). The definition involves two key XR = randomsources of error such as transientper-
notions. First, it is the attributes of objects that are sonal factors which affect the object's score.
measured and not the objects themselves. Second,
the definition does not specify the rules by which A measure is valid when the differences in observed
the numbers are assigned. However, the rigor with scores reflect true differences on the characteristic
which the rules are specified and the skill with which one is attempting to measure and nothing else, that
they are applied determine whether the construct has is, XO = XT. A measure is reliable to the extent that
been captured by the measure. independent but comparable measures of the same
Consider some arbitrary construct, C, such as cus- trait or construct of a given object agree. Reliability
tomer satisfaction. One can conceive at any given depends on how much of the variation in scores is
point in time that every customer has a "true" level attributable to random or chance errors. If a measure
of satisfaction; call this level XT. Hopefully, each is perfectly reliable, XR = 0. Note that if a measure
measurement one makes will produce an observed is valid, it is reliable, but that the converse is not
score, Xo, equal to the object's true score, Xr necessarily true because the observed score when
Further, if there are differences between objects with XR = 0 could still equal XT + X,. Thus it is often
respect to their Xo scores, these differences would said that reliability is a necessary but not a sufficient
be completely attributable to true differences in the condition for validity. Reliability only provides nega-
characteristic one is attempting to measure, i.e., true tive evidence of the validity of the measure. However,
differences in XT. Rarely is the researcher so fortu- the ease with which it can be computed helps explain

its popularity. Reliability is much more routinely keting constructs. The suggested sequence has worked
reported than is evidence, which is much more difficult well in several instances in producing measures with
to secure but which relates more directly to the validity desirable psychometric properties (see Churchill et
of the measure. al., 1974, for one example). Some readers will un-
The fundamental objective in measurement is to doubtedly disagree with the suggested process or with
produce Xo scores which approximate Xr scores as the omission of their favorite reliability or validity
closely as possible. Unfortunately, the researcher coefficient. The following discussion, which details
never knows for sure what the XT scores are. Rather, both the steps and their rationale, shows that some
the measures are always inferences. The quality of of these measures should indeed be set aside because
these inferences depends directly on the procedures there are better alternatives or, if they are used, that
that are used to develop the measures and the evidence they should at least be interpreted with the proper
supporting their "goodness." This evidence typically awareness of their shortcomings.
takes the form of some reliability or validity index, The process suggested is only applicable to multi-
of which there are a great many, perhaps too many. item measures. This deficiency is not as serious as
The analyst working to develop a measure must it might appear. Multi-item measures have much to
contend with such notions as split-half, test-retest, recommend them. First, individual items usually have
and alternate forms reliability as well as with face, considerable uniqueness or specificity in that each
content, predictive, concurrent, pragmatic, construct, item tends to have only a low correlation with the
convergent, and discriminant validity. Because some attribute being measured and tends to relate to other
of these terms are used interchangeably and others attributes as well. Second, single items tend to cate-
are often used loosely, the analyst wishing to develop gorize people into a relatively small number of groups.
a measure of some variable of interest in marketing For example, a seven-step rating scale can at most
faces difficult decisions about how to proceed and distinguish between seven levels of an attribute. Third,
what reliability and validity indices to calculate. individual items typically have considerable measure-
Figure 1 is a diagram of the sequence of steps that ment error; they produce unreliable responses in the
can be followed and a list of some calculations that sense that the same scale position is unlikely to be
should be performed in developing measures of mar- checked in successive administrations of an instru-
All three of these measurement difficulties can be
Figure 1
diminished with multi-item measures: (1) the specific-
ity of items can be averaged out when they are
combined, (2) by combining items, one can make
Recommended Coefficients
relatively fine distinctions among people, and (3) the
or Techniques reliability tends to increase and measurement error
decreases as the number of items in a combination
1. Specify
of construct
Literature search increases.
The folly of using single-item measures is illustrated
by a question posed by Jacoby (1978, p. 93):
__2 . Generate sample
of items Literature search How comfortablewould we feel havingour intelligence
stimulating examples
assessed on the basis of our response to a single
Critical incidents question?"Yet that's exactly what we do in consumer
Focus groups
research.... The literaturereveals hundredsof in-
stances in which responses to a single question suffice
+ to establishthe person'slevel on the variableof interest
Coefficient alpha
and then serves as the basis for extensive analysis
Factor analysis and entire articles.
. . . Given the complexityof our subject matter, what
5. | Collect
j makes us think we can use responses to single items
(or even to two or three items) as measures of these
4 concepts, then relate these scores to a host of other
Coefficient alpha variables, arrive at conclusions based on such an
Split-half reliability investigation,and get away callingwhat we have done
"quality research?"
7. | Assess
Criterion validity
In sum, marketers are much better served with
multi-item than single-item measures of their con-
4 structs, and they should take the time to develop them.
8. | Develop
Average and other
distribution of
This conclusion is particularly true for those investi-
scores gating behavioral relationships from a fundamental

as well as applied perspective, although it applies also done so, one of the main problems cited by Kollat,
to marketing practitioners. Engel, and Blackwell as impairing progress in con-
sumer research-namely, the use of widely varying
SPECIFY DOMAIN OF THE CONSTRUCT definitions-could have been at least diminished (Kol-
The first step in the suggested procedure for devel- lat et al., 1970, p. 328-9).
oping better measures involves specifying the domain Certainly definitions of constructs are means rather
of the construct. The researcher must be exacting than ends in themselves. Yet the use of different
in delineating what is included in the definition and definitions makes it difficult to compare and accumu-
what is excluded. Consider, for example, the construct late findings and thereby develop syntheses of what
customer satisfaction, which lies at the heart of the is known. Researchers should have good reasons for
marketing concept. Though it is a central notion in proposing additional new measures given the many
modern marketing thought, it is also a construct which available for most marketing constructs of interest,
marketers have not measured in exacting fashion. and those publishing should be required to supply
Howard and Sheth (1969, p. 145), for example, define their rationale. Perhaps the older measures are inade-
customer satisfaction as quate. The researcher should make sure this is the
. . .the buyer's cognitive state of being adequatelyor case by conducting a thorough review of literature
inadequately rewarded in a buying situation for the in which the variable is used and should present a
sacrifice he has undergone.The adequacy is a conse- detailed statement of the reasons and evidence as to
quence of matchingactualpast purchaseand consump- why the new measure is better.
tion experiencewith the rewardthatwas expected from
the brandin termsof its anticipatedpotentialto satisfy GENERA TE SAMPLE OF ITEMS
the motives served by the particularproduct class. The second step in the procedure for developing
It includes not only reward from consumptionof the
brandbut any other rewardreceived in the purchasing better measures is to generate items which capture
and consumingprocess. the domain as specified. Those techniques that are
typically productive in exploratory research, including
Thus, satisfaction by their definition seems to be literature searches, experience surveys, and insight-
attitude. Further, in order to measure satisfaction, stimulating examples, are generally productive here
it seems necessary to measure both expectations at (Selltiz et al., 1976). The literature should indicate
the time of purchase and reactions at some time after how the variable has been defined previously and how
purchase. If actual consequences equal or exceed many dimensions or components it has. The search
expected consequences, the consumer is satisfied, but for ways to measure customer satisfaction would
if actual consequences fall short of expected conse- include product brochures, articles in trade magazines
quences, the consumer is dissatisfied. and newspapers, or results of product tests such as
But what expectations and consequences should the those published by Consumer Reports. The experience
marketer attempt to assess? Certainly one would want survey is not a probability sample but a judgment
to be reasonably exhaustive in the list of product sample of persons who can offer some ideas and
features to be included, incorporating such facets as insights into the phenomenon. In measuring consumer
cost, durability, quality, operating performance, and satisfaction, it could include discussions with (1)
aesthetic features (Czepeil et al., 1974). But what about appropriate people in the product group responsible
purchasers' reactions to the sales assistance they for the product, (2) sales representatives, (3) dealers,
received or subsequent service by independent dealers, (4) consumers, and (5) persons in marketing research
as would be needed, for example, after the purchase or advertising, as well as (6) outsiders who have a
of many small appliances? What about customer special expertise such as university or government
reaction to subsequent advertising or the expansion personnel. The insight-stimulating examples could in-
of the channels of distribution in which the product volve a comparison of competitors' products or a
is available? What about the subsequent availability detailed examination of some particularly vehement
of competitors' alternatives which serve the same complaints in unsolicited letters about performance
needs or the publishing of information about the of the product. Examples which indicate sharp con-
environmental effects of using the product? To detail trasts or have striking features would be most produc-
which of these factors would be included or how tive.
customer satisfaction should be operationalized is Critical incidents and focus groups also can be used
beyond the scope of this article; rather, the example to advantage at the item-generation stage. To use the
emphasizes that the researcher must be exacting in critical incidents technique a large number of scenarios
the conceptual specification of the construct and what describing specific situations could be made up and
is and what is not included in the domain. a sample of experienced consumers would be asked
It is imperative, though, that researchers consult what specific behaviors (e.g., product changes, war-
the literature when conceptualizing constructs and ranty handling) would create customer satisfaction or
specifying domains. Perhaps if only a few more had dissatisfaction (Flanagan, 1954; Kerlinger, 1973, p.

536). The scenarios might be presented to the respon- Rather, each item can be expected to have a certain
dents individually or 8 to 10 of them might be brought amount of distinctiveness or specificity even though
together in a focus group where the scenarios could it relates to the concept.
be used to trigger open discussion among participants, The average correlation in this infinmitelylarge ma-
although other devices might also be employed to trix, r, indicates the extent to which some common
promote discourse (Calder, 1977). core is present in the items. The dispersion of correla-
The emphasis at the early stages of item generation tions about the average indicates the extent to which
would be to develop a set of items which tap each items vary in sharing the common core. The key
of the dimensions of the construct at issue. Further, assumption in the domain sampling model is that all
the researcher probably would want to include items items, if they belong to the domain of the concept,
with slightly different shades of meaning because the have an equal amount of common core. This statement
original list will be refined to produce the final measure. implies that the average correlation in each column
Experienced researchers can attest that seemingly of the hypothetical matrix is the same and in turn
identical statements produce widely different answers. equals the average correlation in the whole matrix
By incorporating slightly different nuances of meaning (Ley, 1972, p. 111; Nunnally, 1967, p. 175-6). That
in statements in the item pool, the researcher provides is, if all the items in a measure are drawn from the
a better foundation for the eventual measure. domain of a single construct, responses to those items
Near the end of the statement development stage should be highly intercorrelated. Low interitem cor-
the focus would shift to item editing. Each statement relations, in contrast, indicate that some items are
would be reviewed so that its wording would be as not drawn from the appropriate domain and are pro-
precise as possible. Double-barreled statements would ducing error and unreliability.
be split into two single-idea statements, and if that
proved impossible the statement would be eliminated Coefficient Alpha
altogether. Some of the statements would be recast The recommended measure of the internal consis-
to be positively stated and others to be negatively tency of a set of items is provided by coefficient
stated to reduce "yea-" or "nay-" saying tendencies. alpha which results directly from the assumptions of
The analyst's attention would also be directed at the domain sampling model. See Peter's article in this
refining those questions which contain an obvious issue for the calculation of coefficient alpha.
"socially acceptable" response. Coefficient alpha absolutely should be the first
After the item pool is carefully edited, further measure one calculates to assess the quality of the
refinmementwould await actual data. The type of data instrument. It is pregnant with meaning because the
collected would depend on the type of scale used square root of coefficient alpha is the estimated
to measure the construct. correlation of the k-item test with errorless true scores
(Nuinnally, 1967, p. 191-6). Thus, a low coefficient
PURIFY THE MEASURE alpha indicates the sample of items performs poorly
The calculations one performs in purifying a measure in capturing the construct which motivated the mea-
depend somewhat on the measurement model one sure. Conversely, a large alpha indicates that the k-item
embraces. The most logically defensible model is the test correlates well with true scores.
domain sampling model which holds that the purpose If alpha is low, what should the analyst do?' If
of any particular measurement is to estimate the score the item pool is sufficiently large, this outcome sug-
that would be obtained if all the items in the domain gests that some items do not share equally in the
were used (Nunnally, 1967, p. 175-81). The score that common core and should be eliminated. The easiest
any subject would obtain over the whole sample way to finmdthem is to calculate the correlation of
domain is the person's true score, XT. each item with the total score and to plot these
In practice, though, one does not use all of the correlations by decreasing order of magnitude. Items
items that could be used, but only a sample of them. with correlations near zero would be eliminated.
To the extent that the sample of items correlates with Further, items which produce a substantial or sudden
true scores, it is good. According to the domain drop in the item-to-total correlations would also be
sampling model, then, a primary source of measure- deleted.
ment error is the inadequate sampling of the domain
of relevant items.
Basic to the domain sampling model is the concept 'Whatis "low" for alphadependson the purposeof the research.
of an infinitely large correlation matrix showing all For earlystages of basic research,Nunnally(1967)suggestsreliabil-
correlations among the items in the domain. No single ities of .50 to .60 suffice and that increasingreliabilitiesbeyond
item is likely to provide a perfect representation of .80 is probablywasteful. In many appliedsettings, however, where
importantdecisions are made with respect to specific test scores,
the concept, just as no single word can be used to "a reliabilityof .90 is the minimumthat should be tolerated, and
test for differences in subjects' spelling abilities and a reliabilityof .95 should be consideredthe desirable standard"
no single question can measure a person's intelligence. (p. 226).

If the construct had, say, five identifiable dimen- contentof each half [part] beforelookingat component
sions or components, coefficient alpha would be intercorrelations.
calculated for each dimension. The item-to-total cor-
When factor analysis is done before the purification
relations used to delete items would also be based
on the items in the component and the total score steps suggested heretofore, there seems to be a ten-
for that dimension. The total score for the construct dency to produce many more dimensions than can
be conceptually identified. This effect is partly due
would be secured by summing the total scores for
to the "garbage items" which do not have the common
the separate components. The reliability of the total
core but which do produce additional dimensions in
construct would not be measured through coefficient
the factor analysis. Though this application may be
alpha, but rather through the formula for the reliability
of linear combinations (Nunnally, 1967, p. 226-35). satisfactory during the early stages of research on
a construct, the use of factor analysis in a confirmatory
Some analysts mistakenly calculate split-half reli-
fashion would seem better at later stages. Further,
ability to assess the internal homogeneity of the mea- theoretical arguments support the iterative process of
sure. That is, they divide the measure into two halves.
the calculation of coefficient alpha, the elimination
The first half may be composed of all the even-num-
of items, and the subsequent calculation of alpha until
bered items, for example, and the second half all the
a satisfactory coefficient is achieved. Factor analysis
odd-numbered items. The analyst then calculates a
then can be used to confirm whether the number of
total score for each half and correlates these total
dimensions conceptualized can be verified empirically.
scores across subjects. The problem with this approach
is that the size of this correlation depends on the Iteration
way the items are split to form the two halves. With,
The foregoing procedure can produce several out-
say, 10 items (a very small number for most measure-
comes. The most desirable outcome occurs when the
ments), there are 126 possible splits.2 Because each
of these possible divisions will likely produce a dif- measure produces a satisfactory coefficient alpha (or
ferent coefficient, what is the split-half reliability? alphas if there are multiple dimensions) and the dimen-
sions agree with those conceptualized. The measure
Further, as the average of all of these coefficients
is then ready for some additional testing for which
equals coefficient alpha, why not calculate coefficient
a new sample of data should be collected. Second,
alpha in the first place? It is almost as easy, is not
factor analysis sometimes suggests that dimensions
arbitrary, and has an important practical connotation. which were conceptualized as independent clearly
Factor Analysis overlap. In this case, the items which have pure
loadings on the new factor can be retained and a
Some analysts like to perform a factor analysis on new alpha calculated. If this outcome is satisfactory,
the data before doing anything else in the hope of additional testing with new data is indicated.
determining the number of dimensions underlying the The third and least desirable outcome occurs when
construct. Factor analysis can indeed be used to the alpha coefficient(s) is too low and restructuring
suggest dimensions, and the marketing literature is of the items forming each dimension is unproductive.
replete with articles reporting such use. Much less In this case, the appropriate strategy is to loop back
prevalent is its use to confirm or refute components to steps 1 and 2 and repeat the process to ascertain
isolated by other means. For example, in discussing what might have gone wrong. Perhaps the construct
a test composed of items tapping two common factors, was not appropriately delineated. Perhaps the item
verbal fluency and number facility, Campbell (1976, pool did not sample all aspects of the domain. Perhaps
p. 194) comments: the emphases within the measure were somehow
distorted in editing. Perhaps the sample of subjects
Recognizingmultidimensionalitywhen we see it is not
always an easy task. For example, rules for when to was biased, or the construct so ambiguous as to defy
stop extracting factors are always arbitraryin some measurement. The last conclusion would suggest a
sense. Perhaps the wisest course is to always make fundamental change in strategy, starting with a re-
the comparison between the split half and internal thinking of the basic relationships that motivated the
consistency estimates after first splitting the compo- investigation in the first place.
nents into two halves on a priori grounds. That is,
every effort should be made to balance the factor ASSESS RELIABILITY WITH NEW DATA
The major source of error within a test or measure
is the sampling of items. If the sample is appropriate
2The number of possible splits with 2n items is given by the and the items "look right," the measure is said to
formula (Bohrnstedt, 1970). For the example cited, haveface or content validity. Adherence to the steps
2(n !) (n!) suggested will tend to produce content valid measures.
10! But that is not the whole story! What about transient
n = 5 and the formula reduces to
2(5!) (5!) personal factors, or ambiguous questions which pro-

duce guessing, or any of the other extraneous influ- consistent or internally homogeneous set of items.
ences, other than the samplingof items, which tend Consistency is necessary but not sufficient for con-
to produce errorin the measure? struct validity (Niunnally,1967, p. 92).
Interestingly,all of the errors that occur within a Rather, to establish the construct validity of a
test can be easily encompassedby the domainsampling measure,the analystalso mustdetermine(1) the extent
model. All the sources of error occurring within a to which the measure correlates with other measures
measurementwill tend to lower the averagecorrelation designed to measure the same thing and (2) whether
among the items within the test, but the average the measure behaves as expected.
correlationis all that is needed to estimate the reliabil-
ity. Suppose, for example, that one of the items is CorrelationsWithOtherMeasures
vague and respondents have to guess its meaning. A fundamental principle in science is that any
This guessing will tend to lower coefficient alpha, particularconstruct or trait should be measurableby
suggestingthere is error in the measurement.Subse- at least two, and preferablymore, different methods.
quentcalculationof item-to-totalcorrelationswill then Otherwise the researcher has no way of knowing
suggest this item for elimination. whether the trait is anything but an artifact of the
Coefficient alpha is the basic statistic for determin- measurementprocedure.All the measurementsof the
ing the reliability of a measure based on internal traitmay not be equallygood, but science continually
consistency. Coefficient alpha does not adequately emphasizesimprovementof the measures of the vari-
estimate, though, errors caused by factors external ables with which it works. Evidence of the convergent
to the instrument,such as differences in testing situa- validity of the measure is provided by the extent to
tions and respondents over time. If the researcher which it correlateshighlywith othermethodsdesigned
wants a reliability coefficient which assesses the to measure the same construct.
between-testerror, additionaldata must be collected. The measures should have not only convergent
It is also advisable to collect additionaldata to rule validity, but also discriminantvalidity. Discriminant
out the possibility that the previous findings are due validity is the extent to which the measure is indeed
to chance. If the constructis morethana measurement novel and not simply a reflection of some other
artifact, it should be reproduced when the purified variable. As Campbelland Fiske (1959) persuasively
sample of items is submitted to a new sample of argue, "Tests can be invalidatedby too high correla-
subjects. tions with other tests from which they were intended
Because Peter's article treats the assessment of to differ" (p. 81). Quite simply, scales that correlate
reliability,it is not examined here except to suggest too highly may be measuringthe same rather than
thattest-retestreliabilityshouldnot be used. The basic differentconstructs. Discriminantvalidity is indicated
problem with straighttest-retest reliabilityis respon- by "predictablylow correlationsbetween the measure
dents' memories. They will tend to reply to an item of interest and other measures that are supposedly
the same way in a second administrationas they did not measuringthe same variableor concept" (Heeler
in the first. Thus, even if an analyst were to put and Ray, p. 362).
together an instrumentin which the items correlate A useful way of assessing the convergent and
poorly, suggestingthere is no common core and thus discriminant validity of a measure is through the
no construct, it is possible and even probable that multitrait-multimethod matrix, which is a matrix of
the responses to each item would correlatewell across zero order correlationsbetween different traits when
the two measurements. The high correlation of the each of the traits is measured by different methods
totalscores on the two tests wouldsuggestthe measure (Campbelland Fiske, 1959). Table 1, for example, is
had small measurementerror when in fact very little the matrix for a Likert type of measure designed to
is demonstratedabout validity by straight test-retest assess salesperson job satisfaction (Churchillet al.,
correlations. 1974). The four essential elements of a multitrait-
multimethodmatrix are identified by the numbersin
ASSESS CONSTRUCTVALIDITY the upper left corner of each partitionedsegment.
Specifying the domainof the construct, generating Only the reliability diagonal (1) correspondingto
items that exhaust the domain, and subsequently the Likert measureis shown; data were not collected
purifyingthe resultingscale shouldproducea measure for the thermometerscale because it was not of interest
which is content or face valid and reliable. It may itself. The entries reflect the reliability of alternate
or may not produce a measure which has construct forms administered two weeks apart. If these are
validity. Construct validity, which lies at the very unavailable,coefficient alpha can be used.
heart of the scientific process, is most directly related Evidenceaboutthe convergentvalidityof a measure
to the question of what the instrument is in fact is provided in the validity diagonal(3) by the extent
measuring-what construct,trait,or concept underlies to which the correlations are significantly different
a person's performanceor score on a measure. from zero and sufficiently large to encourage further
The preceding steps should produce an internally examinationof validity. The validity coefficients in

Table 1

Method l--Likert Scale Method 2--Thermometer Scale

Job Role Role Job Role Role
Satisfaction Conflict Ambiguity Satisfaction Conflict Ambiguity

Job Satisfaction

Method 1--
Role Conflict
Likert Scale

Role Ambiguity

Job Satisfaction - 5 .4082 -.0546

Method 2--
Role Conflict -.239 4

Role Ambiguity -. 252 .141 .464

Table 1 of .450, .395 and .464 are all significant at relation coefficient such as the coefficient of con-
the .01 level. cordancecan be computedif there are a great many
Discriminant validity, however, suggests three com- comparisons.
parisons, namely that:
The last requirement is generally, though not com-
1. Entries in the validity diagonal(3) should be higher
than the correlationsthat occupy the same row and pletely, satisfied by the data in Table 1. Within each
column in the heteromethod block (4). This is a heterotrait triangle, the pairwise correlations are con-
minimumrequirementas it simply means that the sistent in sign. Further, when the correlations within
correlationbetween two different measures of the each heterotrait triangle are ranked from largest posi-
same variableshouldbe higherthan the correlations tive to largest negative, the same order emerges except
"betweenthatvariableand any othervariablewhich for the lower left triangle in the heteromethod block.
has neithertraitnor methodin common"(Campbell Here the correlation between job satisfaction and role
andFiske, 1959,p. 82). The entriesin Table 1 satisfy ambiguity is higher, i.e., less negative, than that
this condition. between job satisfaction and role conflict whereas
2. The validity coefficients (3) should be higher than the opposite was true in the other three heterotrait
the correlationsin the heterotrait-monomethodtri-
angles (2) which suggests that the correlationwithin triangles (see Ford et al., 1975, p. 107, as to why
a traitmeasuredby differentmethodsmustbe higher this single violation of the desired pattern may not
than the correlations between traits which have represent a serious distortion in the measure).
method in common. It is a more stringent require- Ideally, the methods and traits generating the multi-
ment than that involved in the heteromethodcom- trait-multimethod matrix should be as independent as
parisons of step 1 as the off-diagonalelements in possible (Campbell and Fiske, 1959, p. 103). Some-
the monomethod blocks may be high because of times the nature of the trait rules out the opportunity
methodvariance.The evidence in Table I is consis- for measuring it by different methods, thus introducing
tent with this requirement. the possibility of method variance. When this situation
3. The pattern of correlationsshould be the same in
all of the heterotraittriangles, e.g., both (2) and arises, the researcher's efforts should be directed to
(4). This requirementis a check on the significance obtaining as much diversity as possible in terms of
of the traits when compared to the methods and data sources and scoring procedures. If the traits are
can be achieved by rank ordering the correlation not independent, the monomethod correlations will
coefficients in each heterotraittriangle; though a be large and the heteromethod correlations between
visual inspection often suffices, a rank order cor- traits will also be substantial, and the evidence about

the discriminantvalidity of the measure will not be attitudeor satisfaction.The analystshouldbe cautious
as easily established as when they are independent. in making such an interpretation,though. Suppose
Thus, Campbelland Fiske (1959, p. 103) suggest that the 350 score represents the highest score ever
it is preferableto includeat least two sets of indepen- achieved on this instrument. Suppose it represents
dent traits in the matrix. the lowest score. Clearlythere is a difference.
A better way of assessing the position of the
Does the Measure Behave as Expected? individual on the characteristic is to compare the
Internalconsistency is a necessary but insufficient person'sscore withthe score achievedby otherpeople.
conditionfor construct validity. The observablesmay The technical name for this process is "developing
all relate to the same construct, but that does not norms," althoughit is something everyone does im-
prove that they relate to the specific construct that plicitly every day. Thus, by saying a person "sure
motivatedthe researchin the first place. A suggested is tall," one is saying the individualis much taller
final step is to show that the measure behaves as than others encounteredpreviously. Each person has
expected in relation to other constructs. Thus one a mental standardof what is average, and classifies
often tries to assess whether the scale score can people as tallor shorton the basis of how they compare
differentiate the positions of "known groups" or with this mental standard.
whether the scale correctly predicts some criterion In psychological measurement,such processes are
measure (criterionvalidity). Does a salesperson'sjob formalizedby makingthe implicit standardsexplicit.
satisfaction, as measured by the scale, for example, More particularly,meaning is imputed to a specific
relate to the individual's likelihood of quitting? It score in unfamiliar units by comparing it with the
should, accordingto what is known about dissatisfied total distribution of scores, and this distributionis
employees; if it does not, then one might question summarizedby calculatinga meanand standarddevia-
the qualityof the measureof salespersonjob satisfac- tion as well as other statistics such as centile rank
tion. Note, though, there is circular logic in the of any particularscore (see Ghiselli, 1964, p. 37-102,
foregoingargument.The argumentrests on four sepa- for a particularlylucid and compellingargumentabout
rate propositions(Nunnally, 1967,p. 93): the need and the proceduresfor norm development).
1. The constructsjob satisfaction(A) and likelihood Norm quality is a function of both the number of
of quitting(B) arerelated. cases on which the average is based and their repre-
2. Thescale X providesa measureof A. sentativeness. The larger the number of cases, the
3. Y providesa measureof B. more stable will be the norms and the more definitive
4. X and Y correlatepositively. will be the conclusionsthat canbe drawn,if the sample
is representativeof the total group the norms are to
Only the fourth proposition is directly examined represent.Oftenit proves necessaryto develop distinct
with empiricaldata. To establishthat X truly measures normsfor separategroups, e.g., by sex or by occupa-
A, one must assume that propositions 1 and 3 are tion. The need for such normsis particularlycommon
correct. One must have a good measure for B, and in basic research, although it sometimes arises in
the theory relatingA and B must be true. Thus, the appliedmarketingresearchas well.
analyst tries to establish the construct validity of a Note thatnormsneed not be developed if one wants
measureby relatingit to a numberof other constructs only to compare salespersons i and j to determine
and not simply one. Further, one also tries to use who is more satisfied, or to determinehow a particular
those theories and hypotheses which have been suffi- individual's satisfaction has changed over time. For
ciently well scrutinizedto inspire confidence in their these comparisons, all one needs to do is compare
probable truth. Thus, job satisfaction would not be the raw scores.
related to job performance because there is much
disagreement about the relationship between these SUMMA R Y AND CONCLUSIONS
constructs (Schwab and Cummings,1970). The purposeof this article is to outline a procedure
which can be followed to develop better measures
DEVELOPING NORMS of marketingvariables.The frameworkrepresentsan
Typically, a raw score on a measuringinstrument attempt to unify and bring together in one place the
used in a marketinginvestigation is not particularly scattered bits of informationon how one goes about
informative about the position of a given object on developingimprovedmeasures and how one assesses
the characteristicbeing measured because the units the quality of the measuresthat have been advanced.
in which the scale is expressed are unfamiliar.For Marketerscertainly need to pay more attention to
example, what does a score of 350 on a 100-item measure development. Many measures with which
Likert scale with 1-5 scoring imply about a salesper- marketersnow work are woefully inadequate, as the
son'sjob satisfaction?Onewouldprobablybe tempted many literaturereviews suggest. Despite the time and
to conclude that because the neutralposition is 3, a dollar costs associated with following the process
350 score with 100statementsimplies slightly positive suggestedhere, the payoffs with respect to the genera-

tion of a core body of knowledge are substantial. Campbell,Donald R. and Donald W. Fiske. "Convergent
As Torgerson (1958) suggests in discussing the ordering andDiscriminantValidationby the Multitrait-Multimethod
of the various sciences along a theoretical-correlational Matrix," Psychological Bulletin, 56 (1959), 81-105.
continuum (p. 2): Campbell,John P. "PsychometricTheory," in Marvin D.
Dunette, ed., Handbook of Industrial and Organizational
It is more than a mere coincidence that the sciences Psychology.Chicago:RandMcNally, Inc., 1976, 185-222.
would order themselves in largely the same way if Churchill, Gilbert A., Jr., Neil M. Ford, and Orville C.
they were classified on the basis to which satisfactory Walker, Jr. "Measuring the Satisfaction of Industrial
measurement of their important variables has been Salesmen," Journal of Marketing Research, 11 (August
achieved. The development of a theoretical science 1974),254-60.
. . . would seem to be virtuallyimpossible unless its Converse, Paul D. "The Development of a Science in
variablescan be measuredadequately. Marketing,"Journalof Marketing,10 (July 1945), 14-23.
Cronbach,L. J. "Coefficient Alpha and the InternalStruc-
Progress in the development of marketing as a science ture of Tests," Psychometrika,16 (1951), 297-334.
certainly will depend on the measures marketers de- Czepeil, JohnA., LarryJ. Rosenberg,andAdebayoAkerale.
velop to estimate the variables of interest to them "Perspectives on ConsumerSatisfaction," in Ronald C.
(Bartels, 1951; Buzzell, 1963; Converse, 1945; Hunt, Curhan, ed., 1974 Combined Proceedings. Chicago: Amer-
1976). ican MarketingAssociation, 1974, 119-23.
Persons doing research of a fundamental nature are Flanagan,J. "The CriticalIncidentTechnique,"Psychologi-
well advised to execute the whole process suggested cal Bulletin, 51 (1954), 327-58.
here. As scientists, marketers should be willing to Ford, Neil M., Orville C. Walker, Jr. and Gilbert A.
make this committment to "quality research." Those Churchill, Jr. "Expectation-Specific Measures of the
IntersenderConflict and Role AmbiguityExperiencedby
doing applied research perhaps cannot "afford" the Industrial Salesmen," Journal of Business Research, 3
execution of each and every stage, although many
(April 1975),95-112.
of their conclusions are then likely to be nonsense, Gardner,BurleighB. "AttitudeResearch Lacks System to
one-time relationships. Though the point could be Help It Make Sense," MarketingNews, 11(May 5, 1978),
argued at length, researchers doing applied work and 1+.
practitioners could at least be expected to complete Ghiselli, Edwin E. Theory of Psychological Measurement.
the process through step 4. The execution of steps New York: McGraw-HillBook Company, 1964.
1-4 can be accomplished with one-time, cross-section- Heeler, RogerM. andMichaelL. Ray. "MeasureValidation
al data and will at least indicate whether one or more in Marketing," Journal of Marketing Research, 9 (No-
isolatable traits are being captured by the measures vember 1972),361-70.
Howard, John A. and Jagdish N. Sheth. The Theory of
as well as the quality with which these traits are being
Buyer Behavior. New York: John Wiley & Sons, Inc.,
assessed. At a minimum the execution of steps 1-4 1969.
should reduce the prevalent tendency to apply ex- Hunt, Shelby D. "The Nature and Scope of Marketing,"
tremely sophisticated analysis to faulty data and there- Journal of Marketing, 40 (July 1976), 17-28.
by execute still another GIGO routine. And once steps Jacoby, Jacob. "ConsumerResearch: A State of the Art
1-4 are done, data collected with each application Review," Journal of Marketing, 42 (April 1978), 87-96.
of the measuring instrument can provide more and Kerlinger, Fred N. Foundations of Behavioral Research,
more evidence related to the other steps. As Ray points 2nd ed. New York: Holt, Rinehart,Winston, Inc., 1973.
out in the introduction to this issue, marketing re- Kollat, David T., JamesF. Engel, andRoger D. Blackwell.
searchers are already collecting data relevant to steps "Current Problems in Consumer Behavior Research,"
Journal of Marketing Research 7 (August 1970), 327-32.
5-8. They just need to plan data collection and analysis
Ley, Philip. Quantitative Aspects of Psychological Assess-
more carefully to contribute to improved marketing ment. London: Gerald Duckworth and Company, Ltd.,
measures. 1972.
Nunnally, Jum C. Psychometric Theory. New York: Mc-
REFERENCES Graw-HillBook Company, 1967.
Bartels, Robert. "Can MarketingBe a Science?," Journal Schwab, D. P. and L. L. Cummings,"Theories of Perfor-
of Marketing,15 (January1951),319-28. mance and Satisfaction:A Review," IndustrialRelations,
Bohmstedt,GeorgeW. "ReliabilityandValidityAssessment 9 (1970), 408-30.
in Attitude Measurement," in Gene F. Summers, ed., Selltiz, Claire, Lawrence S. Wrightsman,and Stuart W.
Attitude Measurement. Chicago: Rand McNally and Cook. Research Methods in Social Relations, 3rd ed. New
Company, 1970, 80-99. York: Holt, Rinehart,and Winston, 1976.
Buzzell, Robert D. "Is Marketing a Science," Harvard Torgerson, Warren S. Theory and Methods of Scaling. New
Business Review,41 (January-February1963),32-48. York: John Wiley & Sons, Inc., 1958.
Calder,Bobby J. "Focus Groupsand the Nature of Qualita- Tryon,RobertC. "ReliabilityandBehaviorDomainValidity:
tive Marketing Research," Journal of Marketing Research, Reformulationand Historical Critique," Psychological
14 (August 1977),353-64. Bulletin, 54 (May 1957), 229-49.