Beruflich Dokumente
Kultur Dokumente
Three types
Guttman
Thurstone
Likert (pronounced Likkert)
A Likert scale (/ˈlɪk.ərt/ LIK-ərt[1] but more commonly pronounced /ˈlaɪ.kərt/ LY-kərt)
It’s the Sample 50-item Big Five questionnaire taken from the web site of the International Personality
Item Pool (IPIP) (http://ipip.ori.org/ipip/).
The 5 constructs measured by Big Five questionnaires are often called domains.
The items on the web site have been modified so that each is a complete sentence.
For example, item 1 on the web site is “Am the life of the party.” Here it is “I am the life of the party.”
Even numbered items have been shaded. I have no evidence that such shading is beneficial.
The IPIP web site recommends a 5-point response scale. I prefer a 7-point response scale.
If you need a 50-item Big Five questionnaire, you may copy and use what follows.
A: 2,7,12,17,22,27,32,37,42,47
C: 3,8,13,18,23,28,33,38,43,48
S: 4,9,14,19,24,29,34,39,44,49
O: 5,10,15,20,25,30,35,40,45,50
Note the periodicity in the placement of items – every 5th item is from the same domain.
7 = Completely Accurate
6 = Very Accurate
5 = Probably Accurate
4 = Sometimes Accurate, Sometimes Inaccurate
3 = Probably Inaccurate
2 = Very Inaccurate
1 = Completely Inaccurate
7 = Completely Accurate
6 = Very Accurate
5 = Probably Accurate
4 = Sometimes Accurate, Sometime Inaccurate
3 = Probably Inaccurate
2 = Very Inaccurate
1 = Completely Inaccurate
The items in this scale are presented as questions. In other instances, they are presented as statements.
If presented as statements, the responses would represent amount of agreement.
If you need an overall job satisfaction scale, you may use this.
For each statement please put a check ( ) in the space showing how you feel about the following aspects
of your job. This time, indicate how satisfied you are with the following things about your job.
VD MD SD N SS MS VS
1 2 3 4 5 6 7
Overall Satisfaction
Consider a single item in a scale, “I am satisfied with my job.” Now consider the true positions of
several respondents, represented by the positions of the top arrows:True Positions
VD MD SD N SS MS VS
1 2 3 4 5 6 7
The response labels put there by the test constructor represent points on a continuum.
They're like 1-foot marks on a scale of height.
So, in the above situation each respondent except the rightmost one would respond, MS, which would be
assigned the value 6. But the rightmost respondent, whose actual position on the dimension is close to
his/her nearest neighbor, would pick VS, creating a considerable “response” distance from that neighbor.
Since each respondent can pick only one of the response CATEGORIES, any response made may miss
the respondent’s true amount of satisfaction by about 7 percent on a 7-point scale, by about 10 percent
on a 5-point scale.
Note the wide range of actual feelings which would be represented by a 6 above.
Consider that two persons very close in their actual feelings about the job could get scores which were
7% apart. E.g., a person whose actual feeling is 6.55 would check 7. But a person whose actual feeling
was 6.45 would check 6. The difference of 1 would be much greater than the actual difference of .1 in
actual feeling. See red arrows above. 6.45 6.55
6 7
This situation is analogous to one that most students have strong feelings about – the use of 5 grades to
represent performance in a course. We all remember those instances in which we missed the next higher
grade mark by a 10th of a point. The use of a single item with just a few response categories is
analogous.
Solution: Use multiple items. While each one may miss its mark considerably, some of the misses will
be positive and some will be negative, cancelling out the positives, so the average of the responses will
be very close to the respondent’s true position on the continuum.
Conclusion: Having multiple items and averaging responses to the multiple items increases
accuracy of identification of the respondent’s true position on a dimension.
Since a single categorical item response involves only a gross approximation to the actual (True) feeling,
on repeated measurement, a person giving only one response might get a very different score (6 vs. 7,
for example) on a single item. This reduces reliability. Reducing reliability reduces estimated validity.
Reducing estimated validity reduces your chances of getting published.
It is possible to assess the reliability of multiple items scales in a single administration of the scale to a
group by computing coefficient alpha. That is not possible with a single-item scale.
Conclusion: Using multiple items and basing the scale score on the sum or mean greatly facilitates
our ability to estimate the reliability of the scale score.
Sometimes, a respondent will have a unique reaction to the wording of a single item. This reaction may
be based on the respondent’s history or understanding of that item. If that item is the only item in the
scale, then the respondent’s position on the dimension will be greatly distorted by that reaction.
Conclusion: Including multiple items and using the sum or mean of responses to them diminishes
the influence of any one idiosyncratic item.
1. Test length.
Nhung faking 5 point response scale; r=-.43 Incentive Faking 7 point response scale; r = +.08
Vikus 5 point response scale; r = -.49 FOR Study Gen 7 point response scale; r = -.04
Bias Study IPIP – 5 point response scale; r = -.32 FOR Study FOR 7 point response scale; r = +.08
Bias Study NEO-FFI 5 point response scale; r = -.34 Worthy Thesis 7 point response scale; r=-.18
I am not familiar with a clear-cut, strong argument either way. I prefer one.
If you analyze the data using Confirmatory Factor Analysis or Structural Equation Modeling, it doesn’t
matter.
My guess (and it’s just a guess) is that you’ll get a few more failures to respond without one, from
people who just can’t make up their minds.
And variability of responses might be slightly smaller with one, from those same people responding in
the middle.
But, I’m not aware of a meta-analysis on this issue.
4. What numeric values should be assigned to each response possibility for analyses based on
sums or means?
Although at one time there were arguments for scaling the various response alternatives, now almost
everyone who analyzes the data traditionally uses successive equally spaced integers. They need not
be, but everyone uses successive, as opposed to every other, for example, integers.
For example
Strongly Strongly
Disagree Disagree Neutral Agree Agree
1 2 3 4 5
Or
Newer Confirmatory Factor Analysis and Structural Equation Modeling based analyses assuming the
data are “Ordered Categorical” require simply that the responses categories be ordered. No numeric
assignment is required.
5. If the analyses are based on sums or means, which integers should be used?
Answer: Any set will do. They should be successive integers.
1 to 5 or 1 to 7
0 to 4 or 0 to 6
-2 to +2 or -3 to +3.
Yes, the God of statistics will strike you down if you make small numbers indicate more of ANY
construct. Being a golfer will not save you.
I strongly prefer assigning numbers so that a bigger response value represents more of the construct as
it is named. I’m sure it’s what the God of Statistics intended.
Negatively worded items may be included, although there is no guarantee that responses to negatively
worded items will be the actual negation of what the responses to a positively worded counterpart would
have been.
Responses to these two items should be perfectly negatively correlated, but often they are not.
Many studies have found that items with negative wording are responded to similarly to other
negatively worded items, regardless of content or dimension, presumably due to the negativity of the
item, regardless of the main content of the items. We have found this in seven datasets.
We’ve also found that items with positive wording are responded to similarly regardless of content just
because of the fact that they’re positively-worded.
Recommendation:
Best: Design and, using factor analysis, analyze your questionnaire so that it permits
estimation of the bias tendencies. Estimate a general factor, a positively-worded item factor,
and a negatively-worded item factor. Treat these three factors as separate indicators of the
construct. Nobody does this now, because the discovery of such wording-related response
tendencies is still being investigated.
Typically, negatively worded items are reverse-scored and then they’re treated as if they had been
positively worded.
Original Reversed
1 5
2 4
3 3
4 2
5 1
If there are no missing values, the sum and the mean will be perfectly correlated – they’re
mathematically equivalent, so you can use either.
The mean is more easily related to the questionnaire items if they all have the same response format.
If there are missing values, use the mean of available items or use imputation techniques to be
described next to impute missing values, after which it won’t matter whether you use the mean or sum.
e. The convention wisdom is changing on issues of missing values. Many modern statistical techniques
are designed to work with all available data. These techniques do not include REGRESSION and GLM.
11. Writing the items. Spector, p. 23...
a. Each item should involve only one idea.
The death penalty should be abolished because it’s against religious law.
b. Avoid colloquialisms, jargon.
I am the life of the party. I shirk my duties. e. Avoid items that might trigger
emotional responses in certain
c. Consider the reading level of the respondent.
samples.
d. Avoid using “not” to create negatively worded items.
Good: Communication in my organization is poor.
Bad: Communication is my organization is not good.
Q1 Q2 Q3 Q4
1 2 2 2 2
2 2 2 3 3
3 2 2 4 4
4 2 2 5 5
5 3 3 2 2
6 3 3 3 3
7 3 3 4 4
8 3 3 5 5
9 4 4 2 2
10 4 4 3 3
11 4 4 4 4
12 4 4 5 5
13 5 5 2 2
14 5 5 3 3
15 5 5 4 4
16 5 5 5 5
For these hypothetical data, Q1 and Q2 are perfectly correlated, as are Q3 and Q4. Obviously, items
within the same scale are not perfectly correlated in real life.
But Q1+Q2 are uncorrelated with Q3+Q4. The constructs are independent.
compute C1=mean(Q1,Q2).
compute C2=mean(Q3,Q4). Syntax to create construct scale scores
correlate c1 with c2.
C2
C1 Pearson Correlation .000
Sig. (2-tailed) 1.000
N 16
Now the correlation between Q1+Q2 with Q3+Q4 is .555, a value that is statistically significant.
The point of this is that differences in participants' response tendencies (e.g., the tendency of some to
use only the upper part of a response scale while others use the lower part of the scale) can result in
positive correlations between constructs that are in fact, uncorrelated.
This is a problem that has been referred to as the method bias problem. The term refers to the fact that
correlations between constructs obtained using the same method are biased upwarly. It plagues the use
of summated rating scales. Many journals will not accept research in which the independent and the
dependent variables are measured using the same method.
2. See if someone else has already created a scale measuring that construct. If so, and if it appears OK,
don’t re-invent the wheel. Faculty. Buros Institute. IPIP web site. Google.
http://buros.org/mental-measurements-yearbook
Remember . . .
4. Have a sample of SMEs rate the extent to which each item represents the construct. Keep only the
best.
a. Assess reliability.
b. Identify bad items, those that reduce reliability, and eliminate them.
c. Assess dimensionality using exploratory factor analysis.
All items in the same scale should represent the same dimension and no other dimension.
7. Perform a validation study assessing convergent and discriminant validity using the population of
interest, perhaps using the pilot sample.
a. Administer other similar scales.
b. Administer other discriminating scales.
Kayitesi Wilt’s thesis – a validation study of the Cultural Intelligence Scale (CQS).
She compared the mean CQS scores of persons who’ve been abroad and enjoyed that travel vs
those who’ve been abroad and not enjoyed it. That will be convergent validity.
She administered the CQS, an Emotional Intelligence Scale, a Social Intelligence Scale, and a
Big Five questionnaire. She assessed discriminant validity of the CQS with respect to the other
scales. It should not be highly correlated with any of them. That’s discriminant validity
8. Administer to a sample from the population of interest along with the other scales that are part of
your research project.
This example is taken from an independent study project conducted by Lyndsay Wrensen examining
factors related to faking of the Big 5 personality inventory. She administered the IPIP Big 5
inventory twice – once under instructions to respond honestly and again (counterbalanced) under
instructions to respond as if seeking a customer service job.
The data here are the honest condition responses to the Extroversion scale. Participants read each item
and indicate how accurately it described them using 1=Very inaccurate to 5=Very accurate. Some of the
items were negatively worded. We now would use a 7-point response scale. This project was done
almost 10 years ago.
of
recode he2 he4 he6 he8 he10 (1=5)(2=4)(3=3)(4=2)(5=1) into he2r he4r he6r he8r he10r.
execute.
However, you do it, put the reverse-scored values in columns that are different from the originals.
For example, use SPSS’s imputation features. Set up a time with me, and I’ll walk you through the
process.
Reliability
Wa rnings
N %
Ca ses Va lid 179 90. 4
Exclude d a 19 9.6
To tal 198 100 .0
a. Listwise delet ion b ased on al l vari ables in th e pro cedure.
Cro nbach's
Alp ha B ased on
Sta ndardized
Cro nbach's A lpha Ite ms N o f Item s
.85 9 .86 0 10
Me an Std . Deviatio n N
he 1 3.1 3 1.1 22 17 9
he 3 3.9 7 .90 8 17 9
he 5 3.7 2 1.0 93 17 9
he 7 3.3 4 1.2 77 17 9
he 9 3.4 1 1.2 16 17 9
he 2r 3.5 6 1.2 54 17 9
he 4r 3.2 7 1.1 36 17 9
he 6r 3.7 9 1.1 10 17 9
he 8r 2.7 4 1.2 24 17 9
he 10r 2.7 0 1.2 85 17 9
he 1 he 3 he 5 he 7 he 9 he 2r he 4r he 6r he 8r he 10r
he 1 1.0 00 .33 4 .24 5 .51 8 .54 2 .36 7 .43 5 .21 1 .33 6 .29 6
he 3 .33 4 1.0 00 .47 3 .37 2 .33 1 .35 4 .36 7 .36 8 .17 0 .38 3
he 5 .24 5 .47 3 1.0 00 .55 3 .32 9 .42 1 .40 7 .48 8 .24 2 .51 9
he 7 .51 8 .37 2 .55 3 1.0 00 .42 7 .39 1 .40 4 .44 6 .19 8 .37 5
he 9 .54 2 .33 1 .32 9 .42 7 1.0 00 .38 2 .49 6 .25 8 .46 1 .29 5
he 2r .36 7 .35 4 .42 1 .39 1 .38 2 1.0 00 .55 0 .57 2 .33 1 .44 8
he 4r .43 5 .36 7 .40 7 .40 4 .49 6 .55 0 1.0 00 .37 5 .37 1 .45 0
he 6r .21 1 .36 8 .48 8 .44 6 .25 8 .57 2 .37 5 1.0 00 .24 1 .41 7
he 8r .33 6 .17 0 .24 2 .19 8 .46 1 .33 1 .37 1 .24 1 1.0 00 .21 0
he 10r .29 6 .38 3 .51 9 .37 5 .29 5 .44 8 .45 0 .41 7 .21 0 1.0 00
Ma ximu m /
Me an Mi nimu m Ma ximu m Ra nge Mi nimu m Va riance N o f Item s
Ite m Me ans 3.3 63 2.6 98 3.9 72 1.2 74 1.4 72 .18 0 10
Ite m Va riances 1.3 63 .82 5 1.6 50 .82 5 2.0 00 .06 5 10
Int er-Ite m Co rrela tions .38 1 .17 0 .57 2 .40 2 3.3 63 .01 0 10
Co rrecte d
Scale M ean if Scale V arian ce Ite m-To tal Sq uare d Mu ltiple Cro nba ch's A lpha
Ite m De leted if I tem Delet ed Co rrela tion Co rrela tion if I tem Delet ed
he 1 30 .50 50 .162 .54 7 .45 6 .84 8
he 3 29 .66 52 .496 .51 6 .31 2 .85 1
he 5 29 .92 49 .504 .61 2 .50 4 .84 2
he 7 30 .29 47 .724 .61 0 .50 1 .84 2
he 9 30 .22 48 .691 .58 6 .44 9 .84 4
he 2r 30 .07 47 .501 .63 8 .48 9 .83 9
he 4r 30 .36 48 .546 .64 9 .45 5 .83 9
he 6r 29 .84 50 .080 .56 0 .44 3 .84 7
he 8r 30 .89 51 .309 .41 7 .27 3 .85 9
he 10r 30 .93 48 .501 .55 7 .37 5 .84 7
Factors scores are computed by differentially weighting each item according to its contribution to the
indication of the dimension. Items which are not highly correlated with the dimension are given little
weight. Those which are highly correlated with the dimension are given more weight.
Note that summated scale scores are computed by equally weighting each item that is thought to
be relevant. So a summated scores is a crude factor score.
The loadings of the items on the factor are used to determine the weights.
They probably better capture the dimension of interest. They’re probably more highly
correlated with the dimension than the simple sum of items.
They can be computed taking into account other factors that might influence the items, thus may
be uncontaminated by the other factors.
The weights will differ from sample to sample so your weighting scheme based on your
sample will differ from my weighting scheme based on my sample.
Item Response Theory (IRT) is a statistical theory of how people respond to items and how to score the
items. It is kind of like factor analysis, but the underlying theory is different from that used in factor
analysis.
IRT methods are used by most large-scale test publishers, such as ETS, ACT, Pearson. IRT methods
routinely incorporate ideas that are not usually considered by persons using summated scales.
If you’re serious about measurement, you’ll have to learn a lot about both factor analytic methods and
IRT methods.
Virtually all scales are scores to represent the level of responses to items representing a dimension.
So, a Conscientiousness score is the average level, the mean of the responses of a person to the
Conscientiousness items in a questionnaire.
We’ve been exploring the relationships of Inconsistency of responding, as measured by the standard
deviation of persons’ responses to items from the same dimension.
Overall UGPA was the criterion. Conscientiousness and Variability were predictors.
UGPA R = .315
-.219
Inconsistency
These data suggest that Inconsistency of Responding may be a valid predictor of certain types of
performance.
References
Reddock, C. M., Biderman, M. D., & Nguyen, N. T. (2011). The relationship of reliability and validity
of personality tests to frame-of-reference instructions and within-person inconsistency.
International Journal of Selection and Assessment, 19, 119-131.
For example: If you give a Big 5 questionnaire to a group of respondents, you can measure the
following 11 attributes
Extraversion Agreeableness Conscientiousness Stability Openness
General Affect Positive Wording Bias Negative Wording Bias
Inconsistency Extreme Response Tendency Acquiescent Response Tendency