Sie sind auf Seite 1von 15

An Evaluation of the Validity of Correlational Research Conducted in Organizations Author(s): Terence R.

Mitchell Source: The Academy of Management Review, Vol. 10, No. 2 (Apr., 1985), pp. 192-205 Published by: Academy of Management Stable URL: http://www.jstor.org/stable/257962 . Accessed: 18/10/2011 12:52
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Academy of Management is collaborating with JSTOR to digitize, preserve and extend access to The Academy of Management Review.

http://www.jstor.org

X Academy of Management Review, 1985, Vol, 10, No. 2, 192-205.

An

of the Validity Evaluation of Correlational Research in Organizations Conducted


TERENCE R. MITCHELL University of Washington
Much of the research conducted in organizational settings uses correlational techniques to infer associations among variables of interest.

This paper reviews some of the problems with this type of research
and develops a checklist for judging the validity of the research methods and designs employed. A sample of published correlational stud-

ies is evaluated, using the checklist. In many cases these studies were
found to have serious design problems. Potential remedies and

responses to these problems are discussed.


A few years ago the Executive Committee of Division 14 (Industrial and Organizational Psychology) of the American Psychological Association initiated a project designed to evaluate current methods for doing research on organizationally related topics with the goal of encouraging new and innovative procedures. Six monographs were produced as a result of these efforts, and much of what is discussed in this paper relates to issues discussed in these volumes. A passage in one of those monographs prompted the thinking and research discussed in this paper. A questionnaire was sent to a random sample of 220 Division 14 members. In response to a question about the type of research that should be done on organizational topics, one anonymous respondent wrote, The addition of a so-called organizationalconcept seems to have opened the door to sloppy, illconsidered, or even non-existent researchmethodology. A whole body of sound research concepts seems to have been suppressed; e.g., reliability of measurement, power of statistical tests, adequate and well-thought through sampling designsunfortunately, a not uncommon phenomenon in the literature today is a "study" in which the authors sent a nonpretested, nonscaled question1The author would like to thank Louis Fry, Gary Latham, and Sandra Mitchell for their comments on an earlier version of this paper. Also, Karen Brown and Robert Pizzi helped with the collection and analysis of the data.

naire to a convenient sample of uncertain nature in which little or no thought was given to the reliability of the measurement or the meaningfulness of responses. Nonetheless, numbers are obtained and are subjected to the staggeringarrayof sophisticated computer-assisted statistical techniques and "results"are obtained and generalized to a population of which the initial sample had very little relationship (Campbell, Daft, & Hulin,
1982, p. 61).

The focus of this passage seems to be the large body of survey/correlational research that appears in the organizational literature. Such research is done in the field, involves no manipulations, is cross-sectional, and makes associational inferences. The current paper reports on an attempt to evaluate the anonymous respondent's contention. A checklist of potential problems of this type of correlational research is developed, and a review of recent published work is conducted using this checklist. The implications of the findings are discussed, with a focus on the changes, if any, that seem appropriate for the research procedures typically used in cross-sectional correlational research.

Background Literature
The distinction between "experimental" and "correlational" research was highlighted in a classic article by Cronbach (1957). Cronbach pointed
192

out that these two methods used different types of samples, measures, analyses, and inferences. It was his hope that a rapprochement would evolve so that both treatment differences and individual differences could be integrated in theories and research. Whether that rapprochement has occurred is still debatable and a topic for another paper. What is clear and obvious, however, is the extent to which methodological expertise has focused on one set of procedures over the other. Perhaps the fact that they were dealing with "causal" issues made experimental topics seem more crucial. Or perhaps the clarity and insight of this tradition was easier to put on paper. In any case, sophistication in judging the validity of experimental research designs seems greater than for correlational research designs. However, some important standards for correlational research may be inferred by an examination of the literature on experimental research. Campbell and Stanley's (1963) paper was a milestone in the experimental tradition. It provided the concepts of internal and external validity and described the procedures required to evaluate them. The discussion of threats to the internal validity of experimental designs outlined in that paper clarifies the issues of causal inference. Campbell and Stanley described three necessary conditions for causal inference: (1) the cause precedes the effect in time, (2) the cause and effect covary, and (3) there is no plausible alternative explanation for the covariation. Their checklist of confounds focused mostly on the third condition, and they described how the design and analyses could confirm the first two conditions. Cookand Campbell (1976, 1979) broadenedthis conceptualization in a number of ways. First, quasi-experimental designs received a more detailed treatment and second, the number of validity concepts expanded to four. The basic focus of these efforts was still causal inference and the language of experimentation. But by thinking about the types of validity required to evaluate an experiment, Cook and Campbell developed a set of concepts that also were helpful for evaluating correlational research. These four concepts internal validity, construct validity, statistical conclusion validity, and external validity - serve
193

as the basis for the checklist developed in the current paper. Internal Validity Internal validity deals with the question of whether the experimental treatment (variable A) did indeed have an effect (on variable B). When there is a lack of internal validity, there is a possible alternative explanation for the relationship between A (the cause) and B (the effect). But, as Cook and Campbell (1976) point out, there are two ways this may occur. First, there is the situation in which a spurious event can be used as a plausible explanation. For example, in a goal setting study in which an experimenter assigns a hard goal to the experimental groups, more people might drop out in the experimental groups than in the control groups, which have an easy goal. The people who dropped out could easily be those who were marginally motivated. Thus, a difference between the hard goal groups and easy goal groups might show that the hard goal groups performed better. Obviously, subject mortality could be confounded with the treatment and be the cause of the observed difference. The Campbell and Stanley list of threats to internal validity (e.g., history, testing, maturation, selection, experimenter effects) was designed to deal with this type of problem. The Cook and Campbell (1976) list covers additional items having to do with relationships between the experimental and control groups (e.g., sharing information, resentment at being in the less desirable treatment). But Cook and Campbell also point out a second way in which there can be an improper inference due to a third variable. Here, however, the variable is not spurious but is a potential substitute for the concept A or B. They label this problem as one of construct validity. Construct Validity Construct validity, according to Cook and Campbell, is what we mean by confounding-'the possibility that the operational definition of a cause or effect can be construed in terms of more than one construct, all of which are stated at the same level of reduction" (1976, p. 238). The confound here is not a spurious one but a labeling one. That is, the third variable at the theoretical level

is seen as being potentially similar or interchangeable with the cause or effect. Lack of construct validity appears empirically as contamination (variance in the measure that is not present in the construct) and/or deficiency (variance in the construct that is not captured by the measure) (Schwab, 1980). This type of validity problem is particularlyrelevant for correlationalresearch in which the construct validity of the measures is infrequently tested (Drasgow & Miller, 1982;
Schwab, 1980).

External Validity

Construct validity, of course, is only one of a set of ways in which measures can be validated. Predictive and concurrent validity also provide informationabout the measures, and discussions about how this validity can be assessed are available (Cronbach& Meehl, 1955; Guion &Cranny, 1982). Most authors agree, however, that construct validity is a broaderidea and that in some sense it includes predictive and concurrent validity. Also, though predictive and concurrent validity may be helpful in some practical applications, without construct validity, they may not further the understanding of why certain relationships occur. To gain this knowledge construct validity is needed. So, to rule out alternative hypotheses, one must check for both design flaws leading to a lack of internal validity and conceptual/measurement flaws that result in a lack of construct validity. Statistical Conclusion Validity This type of validity is discussed by Cook and Campbell (1976) as an additional threat to internal validity. It refers to the idea of "instability," which includes the unreliability of measures, fluctuations in sampling of persons or measures, and the instability of repeated measures. Just as lack of construct validity is caused by what is called "constanterror"(Cronbach&Meehl, 1955) or what Cook and Campbell (1976) call systematic bias, testing for statistical conclusion validity is concerned with sources of random error variance and with the appropriate use of statistics and statistical tests. Unreliable measures or inappropriate tests can lead to incorrect inferences. Again, one should note that concerns relatedto this type of validity-especially the reliabilityand stability of measurementproceduresrelevantfor correlationalresearch. areparticularly
194

External validity (Cook & Campbell, 1976; Campbell & Stanley, 1963) reflects the extent to which the inference drawn from any particular experiment can be generalized to or across times, settings, and persons. There are two ways this can be done. One can include multiple measures taken across multiple occasions and settings with different groups of people in a single study (a majorundertaking).Or one can conduct the type of meta-analysis suggested by Hunter, Schmidt, and Jackson (1982) on an existing body of literature. Whichever approach is used, it must be recognized that correlational field studies are in no way inherently more externally valid than experimental lab studies (Berkowitz & Donnerstein, 1982; Dipboye &Flanagan, 1979, 1980; Hunter et al., 1982; Willems & Howard, 1980). In summary,Cook and Campbell's(1976,1979) extensions provide a more detailed conceptualization of the validity of experiments. Their concepts (and the refinements suggested by those cited below) can be very helpful in attempting to develop a list of potential confounds for correlational research designs. Other papers, such as Brinbergand McGrath's(1982) recent paper on validity, present an even more elegant and comprehensive system for describing validity concepts. A paper by Maher (1978) presents a comprehensive checklist for reviewing research in clinical psychology, including items covering style, figures and tables, and format. However, because Brinberg and McGrath's system and Maher's checklist are somewhat broader than what is needed here and would demand the introduction and definition of a new set of terms, Cook and Campbell's distinctions are used for the current paper. Basically, one asks two questions of either an experimental or correlational result: (1) Is this result a true representation of the relationship between construct A and construct B in this particular study, and (2) Is this relationship generalizable to different populations, measures, and circumstances. Cook and Campbell (1976) added statistical conclusion validity and construct validity, along with the traditional idea of internal validity, as a design confound. All of these relate to the first question. And, as shall be seen later, the debate over external validity (the sec-

ond question) also has highlighted some potential problems with both experimental and correlational research designs.

Validity and Experimental Research


Both McGrath (1982) and Schwab (1980) have depicted the relationships between two variables, A and B, as shown in Figure 1. The relationships between A and B are theoretical statements-part of the theoretical network (Cronbach & Meehl, 1955). They link theoretical constructs to one another through a set of sentences describing the theory. The relationships between A and a or B and b are measurement validity relationships. They reflect the extent to which a particular measure reflects a particular construct. The relationship between a and b in an experimental study shows covariation between a manipulated variable and some other variable (labeled the independent and dependent variables, respectively). The relationship between a and b in a correlational study shows a linear association for measures a and b. Figure 1 Measurement Validity Relationships
CONSTRUCTS
-

l
Measures

l
-

a-

-b

In a hypothetical experiment on goal setting, the hypothesis might be that specific goals result in higher productivity than do general goals. A proofreading task is used and materials prepared with eight errors per page. The experimenter uses a sign-up sheet to obtain voluntary subjects from the pool of students in his or her class, and they are randomly assigned to the experimental (specific goal-to find five errors a page) and control (general goal-do your best) conditions. Everyone participates, and a post-test questionnaire asks whether they had a goal; if so, what it was and their commitment to it. A t-test shows that significantly more errors were found by the experimental group subjects than by the control group subjects.
195

If one evaluates this study in terms of the four types of validity, the internal validity probably would be examined first, and basically the study comes off rather well. Using the Campbell and Stanley checklist there is very little to criticize. One might inquire about demand characteristics, or subjects talking to one another across conditions, or experimenter bias. But perhaps the steps were taken to assure that these problems did not occur (e.g., multiple experimenters were used who were blind to the study's purpose; subjects were run separately). In most cases, if the theoretical question tested was of substantial importance, adequate internal validity would be sufficient for a positive recommendation about publication. The other three types of validity probably would receive little attention. The manipulation check would suggest whether there was a difference between the experimental group and control group in the perceived construct for the goal setting variable. This evidence would support claims for construct validity. One might argue that specific goal setting increases evaluation apprehension and that the results were due to this, not a specific versus general goal setting effect at all. That is, because the student's knew the teacher and were in the teacher's class, evaluation apprehension might have been higher with the specific goal than the general goal. This interpretation might be serious enough to cause problems-or the experimenter might have corrected for it. In any case, construct validity of this type is clearly perceived as being of secondary importance when evaluating this type of research. Statistical conclusion validity usually gets even less attention. The idea of a reliable manipulation is almost never discussed, and neither is the reliability of the criterion unless it is measured over repeated trials (and even then the reliability seldom is reported). Finally, external validity is essentially not used as a standard. Berkowitz and Donnerstein's work (1982) notwithstanding, most laboratory experiments make little claim to external validity. In fact, most authors place it under the "further research" heading. Although the description of the above study is somewhat superficial, it derrionstrates the priorities that are often used in evaluating experimen-

tal research. Internal validity is most important, and issues of sample selection, measurement reliability, construct validity, and generalizability normally are of less concern. The priorities are dramatically different when evaluating correlational studies.

Validity and Correlational Research


Again, as a hypothetical example, a researcher conducts a cross-sectional survey designed to examine the relationships suggested in House's (1977) thorough and interesting theoretical development of the charisma concept. First, a questionnaire is constructed to measure: (1) the characteristics of charismatic leaders (e.g., confident, dominant, purposeful); (2) the behavior of charismatic leaders (e.g., goal articulation, image building); and (3) the reactions of followers (e.g., devotion, unquestioning support, radical change). The reactions by followers indicate the degree to which the subordinates feel they are being led by a charismatic leader and thus serve as the criterion. A large local retail firm agrees to participate in the study. All the subordinates of "middle level" managers are sent questionnaires that assess their perceptions of their managers characteristics and behaviors, as well as the respondents' self reports about their behavioral and attitudinal reactions. The questionnaires are filled out anonymously, although respondents do indicate who their manager is. The forms are returned by mail in envelopes provided for each individual. The response rate is 62 percent. Each specific measure (e.g., confidence) is composed of five Likert type scales, and the coefficient alpha levels are all above .70. (See Cronbach, 1951, for a discussion of alpha.) The results show that each of the leader characteristics and leader behaviors correlates significantly and positively with each of the criteria. Also, multiple regression analyses predicting compliance from the leader characteristics and behaviors result in average R2s of about .40. In summary, there is a good theoretical base, a large response rate, high internal consistency, and significant results. Research conducted in this manner has appeared in the journals. But a look at the types of validity described so far will help judge the adequacy of the study.
196

Internal validity, as defined earlier, has to do with some "spurious" event being related to the treatment (a) or dependent variable (b). In a correlational study one is asking whether some spurious situational event or unexpected, not conceptually similar third variable may be the reason for a particular correlation. The traditional list of confounds suggested by Campbell and Stanley (1963) is of only marginal help here. Because there are no experimental and control groups, issues such as history, selection, maturation, testing, and regression have less relevance for judging the validity of the results. The most salient threat (not a treatment effect) is the "third variable" that may be correlated with X or Y or both, but is not a conceptual replacement for X or Y. One is reminded that a positive correlation between the number of churches and liquor stores in cities was explained by the size of the city-bigger cities have more churches and liquor stores than smaller cities. Conceptually, churches, liquor stores, and city sizes are relatively distinct concepts. The crucial issue here, however, is the word relative. A problem with the "third variable" interpretation is that the interchangeability of constructs often is a matter of degree. For example, one might argue that age was the underlying cause of the confident-compliant coefficient found in the charisma study. The younger the subordinate, the more compliant he or she might be-the older the manager the more confident he or she might appear. One might argue that experience could function the same way. Whether age or experience is defined as interchangeable with confident and compliant is a matter of the theoretical definition of the constructs. If they were part of such definitions, the issue then becomes one of construct validity. If not, then it is an internal validity problem. In either case, systematic thinking and measuring should be done to check for alternative interpretations. One should actively try to conceptualize and measure those variables that may serve as potential confounds. This type of check should be part of the list of checks for the validity of correlational research. Statistical conclusion validity, as mentioned before, has to do with stability, reliability, or any other factor that might affect the statistical analy-

sis in such a way that one makes an incorrect inference (i.e., type 1 or type 2 error). There are four major threats to this type of validity. First is the reliability of the measures. In the hypothetical example on charisma, the authors provided measures of internal consistency that were fairly high. Is this adequate evidence of reliability? Conceptually, the answer has to be no. Reliability can be assessed as the variation across time, measures, settings, or observers (Cronbach, Glaser, Nanda, & Rajaratnam, 1972). The demonstration of internal consistency is only one small part of the concept of reliability, and it is not necessarily the most important. Clearly, correlations between variables that are internally consistent but unstable over time or observers could lead to incorrect inferences. So another item for the list is whether the reliability of the measures is reported and, if so, what type of reliability. A second threat to statistical conclusion validity has to do with the range of values taken by the variables. It is very likely that there is some difference between respondents and nonrespondents that systematically influences the distribution of scores. Perhaps only the very compliant respond-which obviously would affect the range of values obtained on the "compliance" scale. Or perhaps not-so-busy people respond and busy ones do not. Although slightly less obvious, how busy one is may affect the distribution of scores for supervisor confidence or dominance or subordinate compliance. The list is potentially endless. Solving this problem requires thinking about it in advance. The investigator must analyze beforehand the sorts of variables that might be potential confounds and attempt to demonstrate that respondents do not differ from nonrespondents on these variables. The third threat to statistical conclusion validity comes from the nature of regression analyses. Regression coefficients are notoriously subject to shrinkage; and the less reliable the measures, the greater that shrinkage is likely to be. So another check is whether there is any attempt at crossvalidation with a separate sample or the use of a holdout sample. A fourth problem is simply the number of tests. If one has a large correlation matrix, obviously some variables will be significant by chance. This problem usually is handled by most researchers,
197

but it still should be checked. In summarizing the evaluation of statistical conclusion validity, the hypothetical study used only alpha, had no check on respondents versus nonrespondents, and had no holdout samples or cross-validation. There were, however, many more significant coefficients than would be predicted by chance. Construct validity deals with the meaning of the constructs and, as Schwab (1980) has pointed out, research efforts are sorely lacking in this area. Using a broad definition of construct validity, one must look at the extent to which the respondents represent the group from which they are sampled as well as whether the measures represent the constructs. The potential problems in this area are numerous. First, to the extent that the sample is not random or respondents differ from nonrespondents, there may be a construct validity problem. The respondents simply do not represent the initial sample. From a theoretical point of view, construct validity problems also can occur when a third variable is seen as being conceptually similar and equally likely to be an explanation of the discovered relationship. As discussed before, this interchangeable idea is a matter of degree, but in the area of construct validity these variables definitely should have been thought of by the experimenter before the research was conducted. Such "interchangeable" constructs should be assessed and appropriate tests for construct validity applied. There are a number of ways one can demonstrate construct validity (or a lack of it). The works of Schwab (1980), Cronbach and Meehl (1955), and Cook and Campbell (1976) describe the procedures in detail. The one thing that all the authors agree on is that there are a number of major problems and some clearly defined ways to check for the problems. Perhaps the most damaging event in the type of research being discussed is method variance. As Fiske said recently, Method variance is pervasive, ubiquitous. Almost invariably in social and behavioral science, each array of measurements from a construct-method unit contains varianceassociatedwith the method. Any obtainedrelationshipbetween two such units can be due to method variance shared by both
(1982, p. 82).

Fiske goes on to suggest that method constitutes not only questions of format but also item content, general instructions, and other features of the test-task as a whole, along with characteristics of the examiner and the reason the subject is there in the first place. Similarly, halo error, response bias, and "looking good" or social desirability type problems fall under Fiske's broad definition of method variance. What sorts of checks are available for these problems? At the very least, one should cite previous studies that report the different types of measurement validity such as predictive validity (e.g., previous test-performance relationships) or construct validitv (e.g., agreement among multiple measures of the same construct). If thorough work has been done previously, then much concern may be put to rest. However, few of the constructs in organizational studies have been analyzed by, for example, the multimethod multitrait analysis of validity procedures suggested by Campbell and Fiske (1959). And it would be hard to argue that each study do that. On the other hand, if previous work on discriminant and convergent validity has not been done on each of the constructs, there may be a problem. In some cases constructs may have been demonstrated to have construct validity in separate studies. However, if multiple measures with similar formats are used in a particular study, there still can be a problem of method variance. In these cases it seems reasonable to look for (1) multiple measures of each construct, (2) multiple methods of measuring constructs, and (3) the measurement of multiple constructs. Also, if there is some reason to believe an "alternative" third variable is the explanation, such a third variable should be measured and the interrelationship reported. If alternative constructs may be a problem, corrective statistical procedures such as partial correlations can be used. In short, method variance is a serious problem, and in analyzing the hypothetical study one can see that there is a high potential for method variance and response biases (same measurement format, same respondents). A third variable, such as "expertise" or "liking" the supervisor, may explain both perceived confidence and compliance responses. The sample clearly is not random, and there is no check for differences between
198

respondents or nonrespondents on variables that might be partially interchangeable with the constructs (e.g., only high authoritarian people respond-the more they have a dominant leader, the more they comply). No attempts to use multiple measures of constructs and no attempts to look at discriminant validity are made. Clearly, on the construct validity dimensions the described hypothetical study does not do well. External validity is the ability to generalize a particular finding across different measures, settings, and populations. Because the sample was a convenience sample to start with, it is hard to know the group to which one could generalize; and because only one type of measurement procedure was used, the same problem occurs. Thus, in an area in which field surveys are supposed to be strong-external validity-they may, in fact, be just as weak as laboratory studies. Just such an argument was made by Dipboye and Flanagan with their analysis of field studies and lab studies. They conclude that "despite the problems with laboratory research, the findings indicate that there is no empirical basis for a belief in the inherent external validity of field research" (1979, p. 147). Throughout the paper the original Campbell and Stanley concept of internal validity has been broadened. To determine if a correlational relationship has been found in which one can have confidence for a particular population, one must examine the internal validity, statistical conclusion validity, and construct validity of the study. The first-named seems to be viewed by most investigators as most important for experimental studies; the latter two are seen as relatively more important for correlational ones. External validity perhaps is inappropriately seen as inherent in field studies and lacking in laboratory studies.

Survey of Current Research


Sample The Journal of Applied Psychology (Volumes 65-68), Academy of Management Journal (Volumes 22-26), Organizational Behavior and Human Performance (Volumes 23-31) were surveyed. These volumes covered the 1979-1983 time period. To be included in the sample, a study had to: (1) use questionnaires or interviews; (2)

use an organizational sample; (3) use correlational analyses, which may or may not include partial correlations or multiple regression; and (4) not be a short note or methodological study (e.g., multimethod, multitrait analyses of validity). Longitudinal studies that employed cross-lag correlations also are omitted. The focus of the paper is on associational inferences in the traditional cross-sectional study. Measures The articles were reviewed using the questions shown in Table 1. Note that a couple of items from the earlier discussion were omitted from the survey. For example, it was difficult to judge whether people had thought about possible third variables. Also, the error of doing too many statistical tests is so seldom seen that it was not checked. One point needs some further clarification. In most correlational studies there are multiple "independent" (as) and "dependent" (bs) variables. The as and bs most likely are different constructs, but they can be measured on scales with similar formats. This suggests that any significant correlation might be attributable to method variance. Thus, it is important to know whether the measures of a and b use similar formats. But the problem is more complex. Multiple as may be similar/different constructs measured with similar/different formats. The same is true for bs. This complexity presents a 2x2 table for both as and bs. At the most detailed level of specificity one would want to know if there were multiple types of as and bs and multiple measures with same/different formats of each and then whether checks were made for construct validity, discriminant validity, and method variance. Such a task was simply too formidable and confusing because of all the possible combinations that could exist. Therefore, the following types of questions were used when dealing with issues of construct validity:
1. Were as and bs assessed using similar or dif-

Table 1 Information Gathered on Each Article


Item % Agreement

1. Sample:random,cluster, stratified,convenience 2. Response rate 3. Were respondents compared to nonrespondents? 4. Type of method used to gatherdata:personally administered questionnaire, mailed questionnaire,delivered (butnot administered) questionnaire, personal interviews 5. Type of reliability reported:alpha, test/ retest, split half, interrater,none 6. Waspreviousliterature reliabilityand on validity cited? 7. Were there any tests for convergent or discriminate validity? 8. If #7 was yes, what procedurewas used: multiple measures,multimethod, multitrait,factoranalyses, discriminatevalidity? 9. Was the assessment of A and B on measures with a similar format? 10. Were intercorrelations of independent and dependent variablesreportedas evidence for a lack of method variance? 11. If yes to 10, were the coefficients significant? 12. Did the authors address the issue of method variance statistically (e.g., partial correlations,stepwise regression)or conceptually? 13. Was a holdout sample or cross-validation present?

100 100 90

100 90 100 100

100 100

100 100

100 100

ferent measurement formats? 2. Was there an attempt to show construct validity of a or b by the use of convergent validitymultiple measures of the same construct or factor analytic procedures?The formats of the measures could be similar or dissimilar.
199

3. Was there an attempt to measure multicolinearity due to similar formats for assessing as and bs that were different constructs? Note that such information not only addresses the issues of method variance, but also discriminant validity. For example, in the study of charisma a lack of correlation between the constructs "confident," "dominant," and "purposeful" would demonstrate both discriminant validity and a lack of method variance. However, positive intercorrelations can be due to either a lack of construct validity or method variance.

In most cases the authors are interested in removing method variance from multiple predictors using similar measurement formats rather than demonstrating the discriminant validity of the constructs. In almost every instance in which intercorrelations of as and bs are presented (in most cases it is multiple as) the authors then use partial correlations or stepwise multiple regression to predict b. Although this does indeed help to reduce the multicolinearity among as, it also partials out the "true" overlap of constructs. Basically, withotit multiple measures of multiple constructs using multiple formats for both as and bs, these sources of variance cannot be separated. None of the studies reviewed used those procedures. Finally, note the interrater agreement listed in Table 1 after each item. An independent judge was given 10 randomly selected articles, and the reported figures represent agreement with the present author. There was disagreement on one study on the respondents versus nonrespondents

question. Although the study compared the two groups on the criterion (turnover), no comparisons were made for the predictorsor "other"variables that might influence this relationship. One said "yes, there was a comparison of respondents versus nonrespondents,"but the other said "no". There was lack of agreementon one study on the reliability item because the author reported two types of reliability (alphaand test-retest),one rater recorded the alpha, and the other recorded both.
Results

Exhibit 1 presents the findings. The data were broken down further by journal, but various analyses failed to show any large or meaningful differences. Therefore, the data are presented at the global level. First,in terms of sampling issues, the data paint a fairly bleak picture. Over 80 percent of the studies used a convenience sample, and only about 10 percent of the studies compared respondents

Exhibit 1 Overall Results


Cases: Total = 126 JAP = 37 OBHP = 28 AMJ = 61 21 Yes 105 No

1. Samplea: random, cluster, stratified, convenience 2. Response rate: range = 30-94%

Not available for 67 studies 11 Yes MQ=34 115 No DQ=21


TR= 7

3. Respondents vs. nonrespondents: 4. Type of methodb: 5. Reliabilityc: AQ=58 alpha =88

Pl= 19 IR= 10 SH=6

6 studies used more than one method 11 studies used more than one method

NA=26

6. Citations of reliability and/or validity: 7. Construct validity tests: 8. Type of construct validity: 33 Yes

100 Yes 93 No

26 No

multiple measures= 19 Same 52 42 No 22 No 81 Yes 7 Yes

factor analysis= 13 Different 74

discriminate validity = 1

9. Did A & B measures use similar format: 10. Test for method variance: 11. Significant correlations: 84 Yes 65 Yes

39 NAd 45 No 119 No

12. Issue of method variance addressed: 13. Holdout sample or cross-validation:

JAP = Journal of Applied Psychology; OBHP = Organizational Behavior and Human Performance: AMJ = Academy of Management Journal. aYes = a good sample; random, clustered or stratified; No = convenience sample bAQ= administered questionnaire; MQ= mailed questionnaire; DQ= delivered questionnaire; PI = personal interview. CNA= not reported; TR-test/retest; IR = interrater; SH = split half. dFor a few of the articles (N= 3) for which no complete intercorrelation matrices were provided, some correlations showed method variance.

200

with nonrespondents. Of equal importance, over half of the studies did not even report a response rate. This occurs most of the time when questionnaires are administered personally or interviews are done. The researchers in most cases simply call their sample (those who show up) their population, and the concept of sampling loses its meaning altogether. In terms of measurement procedures, questionnaires clearly dominate interviews. The reliability reported also is quite discouraging. Over half the studies used alpha, and only 17 studies (14 percent) reported test-retest or interrater reliabilities. No reliability at all was reported by 26 studies (21 percent). The citation information indicates that at least the researchers cited studies that had used the measures before. Only 26 of the 126 studies (21 percent) failed to give information about instrument reliability or validity from previous studies. However, because the data suggest that most studies rely on alpha as an indication of reliability, citations of past research probably were referring to alpha as the reliability estimate. Also, few studies reported either convergent or discriminant validity. Therefore, the past citations of measurement validity likely referred to face validity or concurrent or predictive validity. To some extent this represents a self-reinforcing cycle of dependence on inadequate measures. The information on construct validity also is rather discouraging. Only 33 of the studies (26 percent) provided information on construct validity, and most of those used multiple measures (19) or factor analysis (13), which provided evidence for convergent validity. Only one study looked at discriminant validity specifically. The tests for method variance were more prevalent, and this seems to be an issue to which most authors and reviewers are sensitive. First, 59 percent of the studies used as and bs that were assessed by measures using different formats. Second, 67 percent of the studies reported the intercorrelations among various "independent" and "dependent" variables. And in most cases in which there were significant correlations an appropriate statistical procedure was used to correct for it. That is, partial correlations or stepwise regressions were used to remove the multicolinearity.
201

Finally, only seven of the studies looked at holdout samples or used cross-validation procedures. Both of these procedures reflect on statistical conclusion validity, and the cross-validation can represent external validity as well.

Discussion
These data are rather unsettling. They suggest that the typical cross-sectional correlational study uses a convenience sample with an administered questionnaire. No response rate is reported, and no comparisons are made between respondents and nonrespondents. The literature on reliability and validity is often cited, but the actual reliability and validity demonstrated is very weak. Alpha predominates as a measure of reliability, and construct validity is infrequently checked. Researchers do seem to be sensitive to the dangers of method variance, but the use of cross-validation or holdout samples is almost nonexistent. The reaction of some readers to the above conclusions will be "So what? We knew that already." However, the survey by Campbell et al. (1982) suggests that not everyone feels the same way. Their respondents indicated substantial disagreement on the issues listed in Table 2. These

Table 2 Conflicting Positions Within the Division 14 Sample


Side One la. Side Two

We need broader, more lb. We need narrower, more generally applicable detailed theories. theory. 2a. Descriptive studies are 2b. Descriptive studies are good. We have very little bad. They pile up knowledge of the uninterpretable data and behavior we are trying do not lead anywhere. to research. 3a. There is too much 3b. There is too little emphasis on measureemphasis on valid ment for measurement's measurement. The field sake. is replete with lousy unvalidated measures. 4a. We actually know quite 4b. We have learned a bit, and some questions virtually nothing about have been substantially organizational behavior. answered.

disagreements were described as "major"by the authors because numerous people took one side or the other on each point listed. Also, note that the results of the present research confirm the expectations of the people who have the opinions on the right side of the table (no surplus meaning intended). The important point is that there is a substantialnumberof people who might take an opposite position. But whether one finds these results surprising or not, otherquestions need to be discussed. First, one would want to know why this state of affairs exists. That is, why is this type of research being done? Second, and perhaps more important, is a question about how correlationalresearch can be done better. That is, how can some of the problems raised by the checklist and the review of current research be remedied? Causes of the Status Quo One possible explanation for the findings is that many researchers do not know how to do good correlational research. To the extent that experimental research is emphasized in research methodology courses, students may indeed lack training in this area. Also, at the moment there is no single paper or book that comprehensively covers such topics to which students or researchers can turn. Though the checklist presented in this paper does not present anything new or unavailable in other sources, it does present all this information in one place. In this respect the checklist and the comments on possible remedies may serve an importantfunction in the literature on correlational research. Another set of explanations has to do with the practicalissues of doing survey researchin actual organizations. In many cases one is simply not able to employ the type of design he/she would like to use. That is, organizational participants simply will not agree to random sampling, testretests, multiple measures, or cross-validation procedures. Also, almost every suggestion made costs time and money-both for the participants and for the experimenters. A somewhat different problem stems from the fact that in many cases researchersare presented with true dilemmas (McGrath,1982). Increasing constructvalidity throughmultiple measuresmay decrease subject motivation and introduce halo
202

or bias because of the length of the questionnaire. Precision (ask fewer questions) can be forfeited for greater generalizability (a larger sample). Alpha can be increased by asking many similar questions, but this may decrease discriminant validity. In short, as McGrath (1982) has convincingly argued, one simply cannot maximize on all dimensions. A final problem has to do with the notion of applied versus basic research. As Cook and Campbell (1976) point out, the field researcher often is more interested in simply showing that some variable (e.g., goal setting) correlates with performance than in demonstrating that the goal setting measure really reflects the goal setting construct rather than an evaluation apprehension construct or a construct reflecting the power of social norms. Thus, if it works, no matter what the reason, the result is important from an applied perspective. The above explanations probably are not exhaustive. Also, more than one may operate in any given study. But even if practical limitations trade-offs, an applied orientation, and insufficient expertise are operating, certain procedures may help. Strategies for Doing Better Research Sample. Three issues concern the sample. First and most obvious, is the response rate. It should be reported (48 studies failed to do this). Also, if the response rate is low, various follow-up procedures (e.g., a second mailing, a phone call, a reminder memo, or a postcard) should have been arranged already. A second issue is whether the sample is representative of anything-even the organization from which it was drawn. Most studies (83 percent) failed to obtain random samples, making it difficult to determine the identity of the population represented by the sample. What is almost universally missing from survey studies is any mention of norms for the measures, for the organization, or for the occupation involved in the study, even though normative data are available for many of the most frequently used measures (e.g., the Leader Behavior Description Questionnaire, the Least Preferred Co-worker Scale, the Job Descriptive Index, the Job Diagnostic Survey, and many more). There also may be normative

data on the organization as a whole to which the sample can be compared. Biographical data probably are readily available (age, income, education, gender), and some data are gathered by unions and the government on various occupational groups. The point is that gathering these data can be done in most cases by simply having the investigator do a little more work, and these data can be very helpful in understanding what sort of group is being studied. Much of the above information also could be useful for the third sample issue: making comparisons between respondents and nonrespondents. Some additional thinking before gathering data again can be helpful. Find out the variables for which measures can be obtained for the organization as a whole (e.g., number of people at different job levels, number of people in different divisions or departments) or for every individual within the organization (e.g., the descriptive data mentioned above). Ask these questions on the questionnaire (e.g., job level, age, income) and then compare the sample with the population. When differences exist between the population and the sample or between respondents and nonrespondents, various corrective procedures are available. For example, if the respondents have 10 percent more college educated people than the nonrespondents, 10 percent of the college educated respondents can be randomly dropped. There also are statistical procedures available for making comparisons between sample and population characteristics. If these procedures show that the sample is not significantly different from the population, one can proceed with the other analyses without changing the constitution of the sample or empirically adjusting the data. Also, the researcher can actively think about and measure those variables that might be correlated with the variables being tested for associational effects (As and BS). If, for example, in the hypothetical charisma study one believes that respondents may be older (or younger) than nonrespondents and that age may be related to charisma, one should make sure to measure age. These data can help both the investigator and the reader understand the results. Finally, a possible strategy for gathering nonrespondent data is to include a postcard with all
203

questionnaires requesting that, if the recipient is not going to fill out the questionnaire, he/she check the items on the postcard and drop it in the mail. A few critical biographical questions or one or two scales can be included. These data, though incomplete, are better than nothing. Measures. Obviously, more information is desired about the reliability and validity of the measures. At the minimum, for the sample issues better reporting is needed. Most studies do cite the source of the measure, but few specifics are given for the type of reliability or validity or its strengths. In conducting the research, the ideal situation would be to get multiple measures of each construct using different methods and to obtain testretest reliability for each measure. However, such thoroughness usually is unattainable. But something can be done with limited cost. In most cases test-retest data can be collected on a small sample (e.g., 15 to 20 people). If test-retests are impossible, the questionnaire can be designed so that some items are repeated. In a long questionnaire, in which multiple constructs are measured with multiple items, such a procedure could be used. Some things can be done to help minimize problems with method variance and construct validity. If multiple constructs are assessed, measures with different formats should be used. If multiple measures of the same construct cannot be used in the questionnaire, perhaps convergent validity data can be obtained in other ways. For example, if a number of people are working at the same job, half of them could fill out one type of measure for describing the job and the other half could use a different measure. Also, just as a small additional sample can help with reliability, it can help with validity. A small group can receive a longer questionnaire with multiple measures of multiple constructs. One could also do some things to reduce other aspects of method variance discussed by Fiske (1982). For example, As and Bs might be gathered at different times or with different experimenters or in different rooms. Anything that changes the test or the test context should reduce method variance. Another suggestion is to split the sample used for reliability or validity purposes. More spe-

cifically, if a sample of about 200 people can be measured only once, randomly assign them to 4 groups of 50 each. Each subgroup might receive a different questionnaire composed of a somewhat different combination of As and BS, potential confounds, reliability or validity checks, and so on. Although some commonality should exist for all four groups, this strategy also can generate data on almost all the other aspects of validity addressed here. Analyses and Inferences. Most investigators seem to be sensitive to problems of multicolinearity because of a lack of independence of constructs, measures, or both. However, little emphasis is placed on convergent or discriminant validity. To demonstrate discriminant validity, the investigator needs to think of and measure variables to which As and Bs should be unrelated or related in an opposite direction from the predicted relationships for As and BS. It does not take many of these to provide such a demonstration, but again it does require that these factors be thought about before the data are gathered. Finally, to increase confidence in both the stability of the finding and its generalizability, more holdout samples and cross-validation is necessary. In many studies the sample clearly is large enough for holdout samples to be used - but this is infrequently done. If the sample is small, a replication with another sample would be desirable. In summary, many things can be done to improve the current state of affairs, and most of the above strategies are within the investigator's control - they simply require more work or more preliminary thinking. However, in some cases some of these strategies simply may not be possible. What may be done in these circumstances? Problem Settings All research, no matter what type, is a series of decisions, and for many of these choices there is no "right" answer (recall the discussion of dilemmas). When the investigator is confronted with practical problems that limit his or her ability to conform to the checklist, it seems reasonable to expect, at a minimum, a report on the reasoning behind such decisions. That is, more
204

information about the research context and its limitations would be helpful. A second response when one confronts a setting that limits the ability to use the correlational methodology is the use of a different methodology. The December 1979 issue of Administrative Science Quarterly presented a series of papers discussing alternative research strategies. Included were calls for more participant observation, ethnographies, case studies, the use of diaries, unobtrusive measures, and so on (Van Maanen, Dabbs, & Faulkner, 1982). The use of new and different techniques certainly should be encouraged. But, again, one should not expect that any of these techniques permit maximization on the three dimensions of generalizability, precision, and realism (McGrath, 1982). Every one of these "innovative" techniques has its own shortcomings. A third response presents a very different perspective. Researchers who have developed and presented various validity generalization models (Callender & Osburn, 1980; Schmidt & Hunter, 1977) emphasize the importance of having a large pool of studies on which meta-analysis can be performed. Validity generalization models allow an assessment of the impact of, and correction for, the artifacts of error of measurement (unreliability), restriction in range, and sampling error (unsystematic error in single studies). Validity generalization studies have demonstrated that such sampling error is the major source of differences across studies and that other types of errors may be relatively unimportant in estimating validities (e.g., race of respondent). Thus, if multiple studies are available, meta-analysis may help correct for certain types of design flaws. It is clear that much published work does not conform to some fairly standard guidelines for doing acceptable cross-sectional correlational research. Correlational research of the variety described can be and is a powerful research tool. The purpose of this paper is not to compare the strengths and weaknesses of cross-sectional correlational studies with those of other strategies or to suggest how various research strategies can be combined - such as the use of longitudinal surveys allowing cross-lag analyses or highly articulated theories allowing for path analysis. These are topics for a different paper. Rather, it

is hoped that the checklist and suggestions presented in this paper will help investigators deal with the complexities and practical problems inherent in the use of the correlational methodology. More thorough reporting, innovative measurement, and preresearch planning is

needed. And the use of meta-analysis where feasible can add to knowledge in the area. Correlational methodology is a valuable research tool. With the recognition of where and how it can be done better, it will continue to serve an important function in organizational research.

References
Berkowitz, L., & Donnerstein, E. External validity is more than skin deep: Some answers to criticisms of laboratory experiments. American Psychologist, 1982, 37, 245-257. Brinberg, D., & McGrath, J. E. A network of validity concepts within the research process. In D. Brinberg & L. H. Kidder (Eds.), Forms of validity in research. San Francisco: Jossey-Bass, 1982, 5-22. Callender, J. C., & Osburn, H. G. Development and test of a new model for validity generalization. Journal of Applied Psychology, 1980, 65, 543-558. Campbell, D. T., & Fiske, D. W. Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 1959, 56, 81-105. Campbell, D. T., & Stanley, J. C. Experimental and quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbook of research on teaching. Chicago: Rand McNally, 1963, 171-246. Campbell, J. P., Daft, R. L., & Hulin, C. L. What to study: Generating and developing research questions. Beverly Hills, CA: Sage, 1982. Cook, T. D., & Campbell, D. T. The design and conduct of quasi-experiments and true experiments in field settings. In M. Dunnette (Ed.), Handbook of industrial and organizational psychology. Skokie, IL: Rand McNally, 1976, 223326. Cook, T. D., & Campbell, D. T. Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McNally, 1979. Cronbach, L. J. Coefficient alpha and the internal structure of tests. Psychometrika, 1951, 16, 297-334. Cronbach, L. J. The two disciplines of scientific psychology. American Psychologist, 1957, 12, 671-684. Cronbach, L. J., & Meehl, P. E. Construct validity in psychological tests. Psychological Bulletin, 1955, 52, 281-302. Cronbach, L. J., Glaser, G. C., Nanda, H., & Rajanratnam, N. The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley, 1972. Dipboye, R. L., & Flanagan, M. F. Research settings in industrial and organizational psychology: Are findings in the field more generalizable than in the laboratory? American Psychologist, 1979, 34, 141-150. Dipboye, R. L., & Flanagan, M. F. Reply to Willems and Howard. American Psychologist, 1980, 35, 388-390. Drasgow, F., & Miller, H. E. Psychometric and substantive issues in scale construction and validation. Journal of Applied Psychology, 1982, 67, 268-279. Fiske, D. W. Convergent-discriminant validation in measurements and research strategies. In D. Brinberg & L. H. Kidder (Eds), Forms of validity in research. San Francisco: JosseyBass, 1982, 77-92. Guion, R. M., & Cranny, C. J. A note on concurrent and predictive validity designs: A critical reanalysis. Journal of Applied Psychology, 1982, 67, 239-244. House, R. J. A theory of charismatic leadership. In J. G. Hunt & L. L. Larson (Eds.), Leadership: The cutting edge. Carbondale, IL: Southern Illinois University Press, 1977, 189-207. Hunter, J. E., Schmidt, F. L., & Jackson, G. B. Meta-analysis: Cumulating research findings across studies. Beverly Hills, CA: Sage, 1982. James, L. R., Mulaik, S. A., & Brett, J. M. Causal analysis: Assumptions, models, and data. Beverly Hills, CA: Sage, 1982. Maher, B. A. A reader's, writer's, and reviewer's guide to assessing research reports in clinical psychology. Journal of Consulting and Clinical Psychology, 1978, 46, 835-838. McGrath, J. E. Dilemmatics: The study of research choices and dilemmas. In J. E. McGrath, J. Martin, & R. A. Kulka (Eds.), Judgment calls in research. Beverly Hills, CA: Sage, 1982. Schmidt, F. L., & Hunter, J. E. Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 1977, 62, 529-540. Schwab, D. P. Construct validity in organizational behavior. In L. L. Cummings & B. Staw (Eds.), Research in Organizational Behavior (Vol. 2), Greenwich, CT: JAI Press, 1980, 3-43. Van Maanen, J., Dabbs, J. M., & Faulkner, R. R. Varieties of qualitative research. Beverly Hills, CA: Sage, 1982. Willems, E. P., & Howard, G. S. The external validity of papers on external validity. American Psychologist, 1980, 35, 387-388.

Terence R. Mitchell is a Professor of Management and Organization and of Psychology at the University of Washington, Seattle, Washington.

205

Das könnte Ihnen auch gefallen