Beruflich Dokumente
Kultur Dokumente
Background and Purpose. The aim of this study was to examine 5 commonly
used questionnaires for assessing disability in people with low back pain. The
modified Oswestry Disability Questionnaire, the Quebec Back Pain Disability
Scale, the Roland-Morris Disability Questionnaire, the Waddell Disability
Index, and the physical health scales of the Medical Outcomes Study 36-Item
Short-Form Health Survey (SF-36) were compared in patients undergoing
physical therapy for low back pain. Subjects and Methods. Patients with low
back pain completed the questionnaires during initial consultation with a
physical therapist and again 6 weeks later (n106). Test-retest reliability was
examined for a group of 47 subjects who were classified as unchanged and
a subgroup of 16 subjects who were self-rated as about the same. Responsiveness was compared using standardized response means, receiver operating
characteristic curves, and the proportions of subjects who changed by at least
as much as the minimum detectable change (MDC) (90% confidence interval
[CI] of the standard error for repeated measures). Scale width was judged as
adequate if no more than 15% of the subjects had initial scores at the upper
or lower end of the scale that were insufficient to allow change to be reliably
detected. Results. Intraclass correlation coefficients (2,1) calculated to measure reliability for the subjects who were classified as unchanged and those
who were self-rated as about the same were greater than .80 for the Oswestry
and Quebec questionnaires and the SF-36 Physical Functioning scale and less
than .80 for the Waddell and Roland-Morris questionnaires and the SF-36
Role LimitationsPhysical and Bodily Pain scales. None of the scales were
more responsive than any other. Discussion and Conclusion. Measurements
obtained with the modified Oswestry Disability Questionnaire, the SF-36
Physical Functioning scale, and the Quebec Back Pain Disability Scale were
the most reliable and had sufficient width scale to reliably detect improvement
or worsening in most subjects. The reliability of measurements obtained
with the Waddell Disability Index was moderate, but the scale appeared to
be insufficient to recommend it for clinical application. The RolandMorris Disability Questionnaire and the Role LimitationsPhysical and
Bodily Pain scales of the SF-36 appeared to lack sufficient reliability and
scale width for clinical application. [Davidson M, Keating JL. A comparison
of five low back disability questionnaires: reliability and responsiveness.
Phys Ther. 2002;82:8 24.]
M Davidson, PT, BAppSc, is Lecturer, School of Physiotherapy, La Trobe University, Bundoora, 3053, Melbourne, Australia
(M.Davidson@latrobe.edu.au). Address all correspondence to Ms Davidson.
JL Keating, PT, PhD, is Lecturer, School of Physiotherapy, La Trobe University.
Ms Davidson provided concept/research design, writing, data collection and analysis, and project management. Dr Keating provided consultation
(including review of manuscript before submission).
This study was approved by the Human Ethics Committee of La Trobe University.
This article was submitted October 18, 2000, and was accepted June 15, 2001.
Table 1.
Characteristics of the Oswestry Disability Questionnaire,8,9 Quebec Back Pain Disability Scale,10 Roland-Morris Disability Questionnaire,11
Waddell Disability Index,12 and Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36) Physical Functioning, Role
LimitationsPhysical, and Bodily Pain Scales13,14
Questionnaire
Reference Perioda
No. of
Items
in Scale
No. of
Response
Options
Score
Range
Better Function
Indicated by
Not specified
10
0100
Lower scores
Today
20
0100
Lower scores
Today
24
024
Lower scores
09
Lower scores
Now
10
0100
Higher scores
Past 4 wk
0100
Higher scores
Past 4 wk
5 and 6
0100
Higher scores
SEMSDav (1R)
SEMrepeat 2SEM
1.64 (the tabled z value). This calculation can be interpreted as the magnitude of change, expressed in scale
points, required to be 90% confident that the observed
change reflects real change and not just measurement
error.17
Unless subjects score far enough onto the scale to allow
change by at least as much as the MDC, there is
insufficient scale width to reliably detect change over
time. To evaluate scale width, we calculated for each
questionnaire the proportion of the 140 subjects who
returned the initial questionnaire who did not register
an initial score that would allow at least that amount of
improvement or worsening to be registered at follow-up.
Responsiveness was quantified in 3 ways. We used one
distribution-based method (standardized response
means [SRMs]), one criterion-based method (receiver
operating characteristic [ROC] curves), and a method
that counted the proportion of subjects who changed by
at least as much as the MDC. The SRM was calculated by
dividing the mean change by the standard deviation of
change scores.10,20,27,37 We chose the SRM because a
method of testing the significance of observed differences in SRMs has been described by Liang et al.37
Confidence intervals were constructed using the jackknife method detailed by Liang et al,37 and a paired t
test was used to compare the estimated population SRMs
derived by this method.27,37 Rather than compare the
SRMs for questionnaires using every possible pair-wise
comparison, we limited the number of comparisons by
comparing the highest and lowest SRMs until nonsignificant comparisons occurred.
Criterion-based methods of evaluating responsiveness
require that a judgment be made as to whether clinically
meaningful change has or has not occurred.27,28 In this
study, subjects were classified as having improved by an
important amount if they rated their back problem as
completely gone, much better, or better at posttest
and as unchanged if they reported being a little
better, about the same, or a little worse. Receiver
operating characteristic curve analysis was performed
using Accuroc Version 2.0. The area under the ROC
curve reflects the ability of the test to discriminate
between subjects who have improved from subjects who
are unchanged.23,27 A value of 1 for the area under the
curve represents perfect (100%) accuracy, whereas a
value of .50 represents chance alone. Accuroc uses a
chi-square statistic to compare ROC curves for different
questionnaires. Even without the Bonferroni adjustments for the multiple post hoc comparisons, there were
no observed differences in area under the ROC curves
among the different instruments. The 95% CIs of the
Accumetric Corp, 1650 Cedar Ave, Montreal, Quebec, Canada H3G 1A4.
15
10
18
13
11
7.5
2
1
2
2
18
17
4
Much worse (n2)
Survey.13,14
11
0.3
14
11
2
11
13
3
A little better (n28)
10
9
Better (n20)
25
25
16
30
19
16
Much better (n26)
38
33
Completely gone (n6)
17
21
4
5
28
29
17
1.6
10
13
7
5
1.6
0
7
7
1
5
3
6
7
26
22
12
14
47
41
6
26
25
13
15
2.5
1.6
2
1
27
26
46
31
48
38
19
42
32
16
3
2
38
58
SD
X
SD
X
SD
X
SD
X
SD
X
X
Global Change
Scale
Table 2.
RolandMorris
Disability
Questionnaire11
Results
Of 284 patients with a complaint of low back pain, 226
met the eligibility criteria to participate in the study, and
207 (92%) agreed to participate. One hundred forty
participants (68%) returned the first set of questionnaires, and 106 participants (51%) returned the
follow-up package 6 weeks later. Five subjects who completed both sets of questionnaires failed to complete the
global change scale. The time taken to return the
questionnaires at both pretest and posttest was a median
of 8 days. There was no difference in age or sex between
subjects who returned both sets of questionnaires and
those who returned only the first set.
Waddell
Disability
Index12
SD
p1p
n
SF-36
Physical
Functioning
Scale
Quebec
Back Pain
Disability
Scale10
SEp
Oswestry
Disability
Questionnaire8,9
(3)
SF-36 Role
Limitations
Physical Scale
SD
SF-36 Bodily
Pain Scale
Table 3.
Sample Characteristics of Unchanged and Improved Groups
Unchanged
(n47)
Improved
(n52)
Variable
No.
No.
Age (y)
1830
3140
4150
5160
6170
71
4
6
14
4
9
10
8.5
12.8
29.8
8.5
19.1
21.3
6
12
10
11
5
8
11.5
23.1
19.2
21.2
9.6
15.4
Sex
Male
Female
17
30
36.2
63.8
14
38
26.9
73.1
Work situation
Employed
Unemployed
Not in the labor force
14
5
28
29.8
10.6
59.6
24
3
25
46.1
5.8
48.1
Receiving compensation
Yes
No
2
45
4.3
95.7
7
45
13.5
86.5
4.2
21.3
23.4
51.1
9
22
10
9
2
17.3
42.2
19.2
17.3
4.0
8
20
19
17.0
42.6
40.4
20
20
12
38.5
38.5
23.0
3
9
22
13
6.4
19.2
46.8
27.6
5
20
21
5
1
9.6
38.5
40.4
9.6
1.9
Pain location
Back only
Buttock, groin, or thigh
Below knee
Previous episodes
None
15
5
Continuous pain
Missing
Table 4.
Questionnaire Initial and Follow-up Scores for Subjects Classified as Unchanged and Improveda
Subjects Classified as
Unchanged (n47)
Initial
Subjects Classified as
Improved (n52)
Follow-up
Difference
X
Questionnaire
SD
SD
Oswestry Disability
Questionnaire8,9
35
15b
34
15b
9b
41
21b
40
17b
11b
Roland-Morris Disability
Questionnaire11
12
4.6
51
5.2
2.3
20b
8.2
4.9
50
5.2
2.1
23b
SD
0.8
0.3
1
5.1
1.6
t Test
P
Initial
Follow-up
Difference
X
SD
SD
.38
35
17b
19
14b
.54
38
21b
20
16
.30
.31
9.5
4.4
5.9
2.2
3.8
2.6
4.1
2.1
SD
t Test
P
16
18
.000
18
22
.000
5.7
1.9
.000
b
2.5
.000
13b
.77
52
25b
70
21b
18
24b
.000
20
32
22
33
36
.76
19
31
57
42
39
47
.000
32
17b
40
19
20b
.006
35
24
61
21b
26
28b
.000
SF-36Medical Outcomes Study 36-Item Short-Form Health Survey.13,14 For SF-36, a negative change score indicates improvement due to reverse scoring
direction. All questionnaires have a possible score range of 0 100, except for the Roland-Morris Disability Questionnaire (0 24) and the Waddell Disability Index
(0 9).
b
K-S Lilliefors confirms normal distribution of scores.
SF-36Medical Outcomes Study 36-Item Short-Form Health
SEMSD (1-R), where SD is the average standard deviation for pretest and posttest for 106 subjects and R is the ICC (2,1). The MDC is
expressed in the same scale units as the questionnaires and is the 90% confidence interval of the error associated with repeated measurements.
33 (2248)
20 (1320)
14 (921)
.59 (.15.83)
41 (3350)
25 (2030)
18 (1421.5)
Survey.13,14
.37 (.09.59)
SF-36 Bodily Pain scale
62 (4086)
38 (2452)
.47 (.02.78)
16 (927)
10 (616)
7 (412)
27 (1737)
.91 (.76.97)
22 (1729)
66 (5380)
40 (3249)
14 (10.518)
.39 (.11.61)
SF-36 Role LimitationsPhysical scale
28 (2335)
.83 (.71.90)
SF-36 Physical Functioning scale
10 (713)
9.5 (6.313)
2.5 (1.53.8)
5.8 (3.87.9)
1.5 (0.92.3)
4.1 (2.75.6)
1.1 (0.71.6)
.42 (.07.75)
.79 (.51.92)
2.8 (2.13.5)
8.6 (6.710.6)
5.2 (4.16.4)
1.7 (1.32.2)
1.2 (0.91.5)
3.7 (2.94.6)
.74 (.58.85)
.53 (.29.71)
Roland-Morris Disability Questionnaire
11
15 (924)
9 (615)
7 (411)
10.5 (617)
6 (410)
4.5 (37)
.89 (.72.96)
15 (1119)
9 (712)
8 (610)
.84 (.73.91)
Quebec Back Pain Disability Scale10
6 (58)
.84 (.73.91)
Oswestry Disability Questionnaire8,9
11 (8.515)
19 (1424)
.92 (.79.97)
MDC
(95% CI)
SEMrepeat
(95% CI)
SEM
(95% CI)
ICC
(95% CI)
SEM
(95% CI)
ICC
(95% CI)
Questionnaire
SEMrepeat
(95% CI)
MDC
(95% CI)
Test-Retest Reliability (Intraclass Correlation Coefficients [ICC (2,1)]), Standard Error of Measurement (SEM), Standard Error of Repeated Measurement (SEMrepeat), and Minimum Detectable
Change (MDC) for Subjects Classified as Unchanged and Subjects Self-Rated as About the Samea
Table 5.
Table 6.
Subjects Classified as
About the Same (n16)c
Questionnaire
Proportion of
Subjects With
Insufficient Initial
Score to
Reliably Detect
Improvement
(n140)
Proportion of
Subjects With
Insufficient Initial
Score to
Reliably Detect
Improvement
(n140)
11%
0%
3%
0%
19%
4%
14%
1%
51%
16%
51%
16%
21%
20%
21%
20%
11
Proportion of
Subjects With
Insufficient Initial
Score to
Reliably Detect
Deterioration
(n140)
Proportion of
Subjects With
Insufficient Initial
Score to
Reliably Detect
Deterioration
(n140)
13%
15%
9%
10%
21%
87%
21%
86%
11%
54%
6%
54%
Table 7.
Standardized Response Means (SRM), Receiver Operating Characteristic (ROC) Curves, and the Proportion of the Sample Improved at Least as
Much as the Minimum Detectable Change (MDC)a
Proportion Improved > MDC (n106)
95% CI
Based on
Subjects
Classified as
About the
Same
(n16)c
95% CI
Questionnaire
SRM
(n106)
95% CI
ROC
(n99)
95% CI
Based on
Subjects
Classified as
Unchanged
(n47)b
Oswestry Disability
Questionnaire8,9
0.52
0.511.56
.78
.69.87
24%
1633
30%
2139
0.49
0.471.44
.74
.64.84
23%
1531
29%
2038
Roland-Morris Disability
Questionnaire11
0.55
0.541.64
.77
.68.87
22%
1430
17%
1024
Waddell Disability
Index12
0.35
0.331.01
.76
.67.86
21%
1329
21%
1329
SF-36 Physical
Functioning scale
0.44
0.441.34
.74
.64.84
20%
1228
27%
1836
0.45
0.471.43
.73
.64.83
21%
1329
21%
1329
0.67
0.662.00
.73
.63.84
18%
1125
23%
1531
SF-36Medical Outcomes Study 36-Item Short-Form Health Survey.13,14 95% CI95% confidence interval.
Subjects who self-rated their condition as about the same or a little better/worse and who were classified as unchanged.
c
Subjects who self-rated their condition as about the same after 6 weeks.
b
moderate, and the 95% CIs were very wide. We chose the
SRM because it is the only distribution-based method for
which a method of hypothesis testing has been
described.27,37 We believe there is considerable opportunity in the repeated iterations of Liang and colleagues
complex SRM procedure37 for error. The jackknife
procedure used to generate what Liang and colleagues
called pseudo-values37 is performed by systematically
dropping each subjects data from analysis at a time.
That is, the SRM is recalculated n times with each subject
removed in turn. This results in a population of n SRM
pseudo-values around the sample SRM and provides a
sampling distribution of SRMs from which to estimate a
population SRM. The population SRM and variance are
then estimated from the pseudo-values, and finally a t
test is used to compare the tests. We found that the result
was distorted unless calculations were made to 5 decimal
places.
The area under the ROC curve has a possible range from
.50, indicating a chance finding, to 1.0, indicating perfect ability of change scores to discriminate between
changed and unchanged patients. The ROC point estimate in our study fell within a narrow range from .73 to
.78, and there was no difference among the scores from
the questionnaires, suggesting that all of the tests were
equivalent in responsiveness. The ROC values of .78 and
.77 that we obtained for the Oswestry and Roland-Morris
questionnaires are almost identical to those reported by
Stratford and colleagues21 (.78 and .79). Beurskens et
al20 reported a similar ROC value for the Oswestry
questionnaire (.76), but a higher value for the RolandMorris questionnaire (.93).
Criterion-based methods require the sample to be
dichotomized into those subjects who are unchanged
and those who have improved by a certain amount.27,28
The use of patients self-ratings of overall change as the
criterion of meaningful clinical change has several limitations: the measurements have unknown reliability and
validity; recall of initial states tend to be inflated, which
tends to inflate the perceived magnitude of change; and
the scale is completed at the same time as the follow-up
questionnaires and is therefore not independent.52 In
our study, subjects were asked to complete the rating of
change scale before the questionnaires, and the completion of the questionnaires may have been influenced by
the overall rating. However, because the questionnaires
were administered by mail, we have no way of knowing
the order in which the subjects completed the tasks.
Patient self-ratings, or averages of patient and therapist
ratings of overall change, are commonly used as the
criterion of change because of the valued perspective of
the rater(s) and because the information can be collected easily.
The reliable-change method of evaluating responsiveness counted the number of subjects who changed by at
least as much as the MDC over 6 weeks. Because we had
performed 2 reliability analyses, one for the group
classified as unchanged and one for the smaller subgroup who had rated themselves as about the same, we
had 2 estimates of MDC. In neither case was the proportion different among the questionnaires.
In the responsiveness portion of our study, we found that
none of the questionnaires could be shown to be more
or less responsive than any other. Furthermore, it
appears possible for a questionnaire to yield scores with
very poor reliability, but to have reasonable responsiveness. The SF-36 Bodily Pain scales ICC was lower than
.50, but the scale was comparable in responsiveness to
the other questionnaires. This finding may indicate
either that the questionnaires perform similarly in their
ability to detect change over time or that the responsiveness methods are not able to discriminate between
instruments with low and high responsiveness. The
proliferation of responsiveness measures and debate
concerning methods for determining responsiveness
suggest that the optimal way to quantify this relatively
recently conceptualized psychometric property of tests
has not been described.27,28,48,50 The validity of scores
obtained with a responsiveness index could be demonstrated by testing whether the index is able to discriminate between a test that is known to be responsive and
one that is known not to detect change over time in a
particular clinical population.
We suggest that the choice of a responsiveness index
should be dictated by the purpose for which the index is
being used in this application. If the aim is to quantify
the responsiveness of an outcome measure to be used in
research, then we believe that a distribution-based
method would be most appropriate, as this information
could be used to estimate sample size and statistical
power. Distribution-based methods, however, provide no
information about whether change is clinically meaningful. A criterion-based method may be appropriate where
the purpose is to detect meaningful change in a clinical
setting. Distribution-based methods provide information
analogous to a test of statistical significance, and
criterion-based methods are analogous to a judgment of
clinical significance. The reliable-change method, in our
opinion, provides practical information for clinical
application in that it answers the question, In what
proportion of my patients is this questionnaire likely to
detect change beyond the amount that can be attributed
to measurement error? The limitation of this method is
that the MDC may not be known for many questionnaires and clinical tests.
4 weeks, and the Oswestry questionnaire gives no specific time reference. We are unaware of any studies that
have explored this issue, although Fairbank and
Pynsent54 recently reported that patients prefer a format
such as that of the Oswestry questionnaire in which the
time frame now is made explicit.
A surprising result in our study was that although 49% of
the subjects said their condition was better, much
better, or completely gone after 6 weeks, none of the
questionnaires reliably detected change in more than
30% of the subjects (Tab. 7). This result illustrates that
the amount of change in questionnaire scores perceived
by the client to be meaningful may be smaller than the
amount of change required to be statistically 90% confident that score change is not just measurement error
(the MDC). More reliable and responsive methods need
to be developed for measuring activity limitation in
people with low back pain. Perhaps we are currently
overestimating the SEM (and therefore the MDC)
derived from small samples. However, the consequences
of wrongly concluding that a patient with low back pain
either has or has not changed by a measurable amount
based on change in questionnaire scores are unlikely, in
our opinion, to be substantially adverse. If a patients
status does not change by at least as much as the current
MDC within an expected time-frame, the therapist may
decide to alter some component of the treatment regimen, to refer the patient to another health care professional, or to cease therapy. The clinician faced with
interpreting a change in an individual patients questionnaire scores will advisedly use a range of outcome
indicators to provide a picture of overall change.
Although we contend that the modified Oswestry Disability Questionnaire, the SF-36 Physical Functioning
scale, and the Quebec Back Pain Disability Scale appear
to be the most useful measures of functional outcome
for people with low back pain, there are practical
considerations that also influence the choice of questionnaire. If a clinician sees few patients with low back
problems and fast processing of results is the primary
consideration, then the Waddell Disability Index may be
appropriate. Therapists in multidisciplinary clinics may
decide that the SF-36 can provide the more comprehensive assessment required for their purposes. Scale content also provides a point of differentiation. For example, the SF-36 does not ask about difficulty sustaining
body positions such as sitting and standing, and the
Oswestry questionnaire does not include difficulty moving between postures such as sit to stand. The Quebec
questionnaire has more content relating to upper-limb
activities (pulling/pushing, throwing/catching, reaching) than the other scales. Notwithstanding a careful
choice of scale, there will always be some individuals who
do not have a sufficient initial score to enable change to
5 Waddell G, Main CJ, Morris EW, et al. Normality and reliability in the
clinical assessment of backache. BMJ. 1982;284:1519 1530.
8 Fairbank JCT, Couper J, Davies JB, OBrien JP. The Oswestry Low
Back Pain Disability Questionnaire. Physiotherapy. 1980;66:271273.
9 Baker DJ, Pynsent PB, Fairbank JCT. The Oswestry Disability Index
revisited: its reliability, repeatability, and validity, and a comparison
with the St Thomas Disability Index. In: Roland M, Jenner JR, eds. Back
Pain: New Approaches to Rehabilitation and Education. Manchester, United
Kingdom: Manchester University Press; 1989:174 186.
31 van den Hoogen HJM, Koes BW, van Eijk JTM, et al. On the course
of low back pain in general practice: a one year follow up study. Ann
Rheum Dis. 1998;57:1319.
37 Liang MH, Fossel AH, Larson MG. Comparisons of five health status
instruments for orthopedic evaluation. Med Care. 1990;28:632 642.
38 Goldie PA, Matyas TA, Evans OM. Deficit and change in gait velocity
during rehabilitation after stroke. Arch Phys Med Rehabil. 1996;77:
1074 1082.
39 Bland M. An Introduction to Medical Statistics. 2nd ed. New York, NY:
Oxford University Press; 1995.
40 Coakes SJ, Steed LG. SPSS Version 6.1 Analysis Without Anguish.
Brisbane, Queensland: Australia: John Wiley & Sons; 1997.
41 Hopman WM, Towheed T, Anastassiades T, et al. Canadian normative data for the SF-36 health survey. CMAJ. 2000;163:265271.
51 Cohen J. Statistical Power Analysis for the Behavioral Sciences. New York,
NY: Academic Press Inc; 1977.
42 Scott KM, Tobias MI, Sarfati D, Haslett SJ. SF-36 health survey
reliability, validity and norms for New Zealand. Aust N Z J Pub Health.
1999;23:401 406.
53 Juniper EF, Guyatt GH, Feeny DH, et al. Measuring quality of life in
childhood asthma. Qual Life Res. 1996;5:35 46.
54 Fairbank JCT, Pynsent PB. The Oswestry Disability Index. Spine.
2000;25:2940 2953.