Sie sind auf Seite 1von 6

1

Reliability and Item Analysis


o precise measurement of hypothesized processes or variable
o construct reliable measurement scales
o precision of measurement (applied research) whenever variables are
difficult to observe (e.g. employee performance)
o design and evaluation of sum scales (made of multiple individual
measurements
Basic Ideas
Questionnaire to measure peoples prejudices against foreign-made cars
Example: (Based on slogan Real Americans buy American cars!)
Items
1. Foreign-made cars lack personality 1 2 3 4 9
2. Foreign-made cars look the same 1 2 3 4 9
1 =disagree 9= disagree

True Scores and Error
Example: Foreign-made Cars
Two Aspects in the Response
True: prejudice and some esoteric aspect other aspects (the
error e.g. a friend has just bought a foreign-made car)
Classical Model
X = tau + error
where X is the actual measurement (subjects reponse to item)
tau is the true score ( prejudice) and error is the random error (esoteric)

Reliability
Measurement is reliable if it reflects mostly true score, relative to the error
Ex: The item Red foreign-made cars are particularly ugly is unreliable. Why?
- will capture not only a persons prejudice but also his or her color
preference
so that proportion of true score would be small

2
Measure of Reliability
Index of Reliability =
2
observed total
2
score true

o
o



Sum Scales
o sum of several (reliable items)
o Expected (error) = 0 (what does this means?)
o More items, more reliable (sum scale)
Ex: height of ten persons using meter stick
Measure only once not reliable
Measure each person 100 times and get average you will
be able to distinguish reliably between individuals in
terms of their height

Cronbachs Alpha
Several response to items enables one to compute
o Variance for each item
o Variance for sum scale
Theory = Height

Ex:
Respondent 1 Respondent 2 Respondent 3 Respondent n
Item 1 Item 1 Item 1 Item 1
2 2 2 2
3 3 3 3
o =
|
|
|
.
|

\
|

k
sum
2
S
i
2
S
1
1 k
k
1 i

where
3
Si
2
= variance for k individuals
S
2
sum
= variance for the sum of all items

o If there is no true score but only random errors in the items (uncorrelated
across items) then Si
2
=S
2
sum
and o=0
o If all items measure the same thing (true score) then o=1
o Nunnaly (1978) suggests an o>0.7
o For binary items (e.g. yes/no) this is called the Kuder-Richardson-20



Split-half
o Divide sum scale into two halves randomly
o Reliable if two halves are perfectly correlated! (r=1)

xy
r 1
xy
r 2
sb
r
+
=
where r
xy
= correlation between two halves

Designing a Reliable Scale
Step 1. Generating Items.
o Write as many items as possible (essentially a creative process!)
Ex: Can ask a small group of highly commited car buyers to express
their general thoughts and feelings about foreign-made cars.
Step 2. Choosing items of optimum difficulty.
o Item(s) where most respondents agree or disagree with do not help to
discriminate between respondents (useless)
o Known as item difficulty
o Look at item means and standard deviations and eliminate those that show
extreme means, and zero or nearly zero variances
Step 3. Choosing internally consistent items (Cronbachs alpha).
4
o More true score, few esoteric aspects (random errors)
o Check items with small correlations with sum scale, has high alpha when the
item is deleted, and small multiple corellation (Statistica)
o See also other examples using SPSS and SAS (check our web!)


Example:

STATISTIC
A
RELIABL.
ANALYSIS
Summary for scale: Mean=46.1100 Std.Dv.=8.26444 Valid n:100
Cronbach alpha: .794313 Standardized alpha: .800491
Average inter-item corr.: .297818

variable
Mean if
deleted
Var. if
deleted
StDv. if
deleted
Itm-Totl
Correl.
Squared
Multp. R
Alpha if
deleted
ITEM1
ITEM2
ITEM3
ITEM4
ITEM5
ITEM6
ITEM7
ITEM8
ITEM9
ITEM10
41.61000
41.37000
41.41000
41.63000
41.52000
41.56000
41.46000
41.33000
41.44000
41.66000
51.93790
53.79310
54.86190
56.57310
64.16961
62.68640
54.02840
53.32110
55.06640
53.78440
7.206795
7.334378
7.406882
7.521509
8.010593
7.917474
7.350401
7.302130
7.420674
7.333785
.656298
.666111
.549226
.470852
.054609
.118561
.587637
.609204
.502529
.572875
.507160
.533015
.363895
.305573
.057399
.045653
.443563
.446298
.328149
.410561
.752243
.754692
.766778
.776015
.824907
.817907
.762033
.758992
.772013
.763314


Shown above are the results for 10 items. Of most interest to us are the three
right-most columns. They show us the correlation between the respective item
and the total sum score (without the respective item), the squared multiple
correlation between the respective item and all others, and the internal
consistency of the scale (coefficient alpha) if the respective item would be
deleted.

Clearly, items 5 and 6 "stick out," in that they are not consistent with the rest of
the scale. Their correlations with the sum scale are .05 and .12, respectively,
while all other items correlate at .45 or better.

In the right-most column, we can see that the reliability of the scale would be
about .82 if either of the two items were to be deleted. Thus, we would probably
delete the two items from this scale.

5

Step 4: Returning to Step 1. After deleting all items that are not consistent with the
scale, we may not be left with enough items to make up an overall reliable scale
(remember that, the fewer items, the less reliable the scale). In practice, one often
goes through several rounds of generating items and eliminating items, until one
arrives at a final set that makes up a reliable scale.
A Few Commands:
SAS: PROC CORR ALPHA NOMISS;
VAR VAR1-VARn;
RUN;
SPSS: RELIABILITY
/VARIABLES=q1 q2 q3 q4.
CORRELATIONS command:
CORRELATIONS VARIABLES=q1 q2 q3 q4.

STATA: alpha var1-varn
STATISTICA: (Assignment!)
Exercise: Download the file samplealpha.sd2 and samplealpha.sav in our
web and try to do some reliability analysis in SAS and SPSS.
Guide to Interpretation
Reliability

Interpretation

.90 and above Excellent reliability; at the level of the best standardized tests
.80 - .90 Very good for a classroom test
.70 - .80
Good for a classroom test; in the range of most. There are probably a few items
which could be improved.
.60 - .70
Somewhat low. This test needs to be supplemented by other measures (e.g., more
tests) to determine grades. There are probably some items which could be
improved.
.50 - .60
Suggests need for revision of test, unless it is quite short (ten or fewer items). The
test definitely needs to be supplemented by other measures (e.g., more tests) for
grading.
.50 or below
Questionable reliability. This test should not contribute heavily to the course
grade, and it needs revision.

6

http://www.arts.auckland.ac.nz/edu/staff/

Das könnte Ihnen auch gefallen