Sie sind auf Seite 1von 11

Reliability in Evaluating Passive Intervertebral Motion Carmella Gonnella, Stanley V Paris and Michael Kutner PHYS THER.

1982; 62:436-444.

The online version of this article, along with updated information and services, can be found online at: http://ptjournal.apta.org/content/62/4/436 Collections This article, along with others on similar topics, appears in the following collection(s): Injuries and Conditions: Spine Tests and Measurements

e-Letters

To submit an e-Letter on this article, click here or click on "Submit a response" in the right-hand menu under "Responses" in the online version of this article. Sign up here to receive free e-mail alerts

E-mail alerts

Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

Reliability in Evaluating Passive Intervertebral Motion


CARMELLA GONNELLA, STANLEY V. PARIS, and MICHAEL KUTNER

Reliable measurements are prerequisite to the successful conduct of outcome studies. In a study of the performance of physical therapists (n = 5) in evaluating passive mobility of the vertebral column with normal subjects (n = 5), several sources of measurement variability were assessed: the reliability within and between therapists, the criteria for grading, and the subjects themselves. Intratherapist reliability was found to be dependable; intertherapist reliability was not. Problems that merit further study were identified as idiosyncratic behaviors that may develop with experience, subject characteristics, and the instrument itself. Periodic assessment of the reliability of therapists in performing evaluations is recommended because of its importance to therapeutic programing. Key Words: Test reliability, Evaluation studies, Spine.

Interest in evaluating the therapeutic effectiveness of a particular program prompted this study on the reliability of physical therapists in administering a specific evaluative procedure for passive mobility of the vertebral column with an evaluation protocol currently in use. The validity of evaluations (ie, the representativeness of the score given to the actual status of what is being measured) depends in part on the consistency of one's performance in repeated measures (intratherapist reliability). Consistency with one's colleagues using the same instrument under similar conditions (intertherapist reliability) is also important. Hence, before questions about the effectiveness of treatments can be answered, intratherapist and intertherapist reliability in administering specific tools should be known, whether the purposes are for purely clinical reasons (ie, to develop and to account for a therapeutic program), or for research. REVIEW OF LITERATURE Physical therapists have long appreciated the importance of reliability of measurements1; however,
Dr. Gonnella is Associate Director and Research Director, Emory University Rehabilitation Research and Training Center, and Associate Professor, Department of Rehabilitation Medicine and Graduate Programs in Physical Therapy, Emory University School of Medicine, 1441 Clifton Rd NE, Atlanta, GA 30322 (USA). Mr. Paris is Director, Atlanta Back Clinic and President, Institute of Graduate Health Sciences, Atlanta, GA. Dr. Kutner is Professor of Biometry and Statistics, Emory University School of Medicine, Atlanta, GA. This project was funded in part by a joint grant #1969/001 between the Atlanta Back Clinic and the Institute of Graduate Health Sciences, Atlanta, GA. This article was submitted April 11, 1980, and accepted July 10, 1981.

few studies have appeared in the literature of physical therapy. The common finding across these studies that intratherapist reliability is acceptably high is reassuring. This intratherapist consistency has served therapists well in their responsibility to assess patient performance, even though establishing and checking the reliability of therapists periodically has never been, to our knowledge, a routine procedure in the clinic or during basic professional preparation. Before concluding prematurely that such a practice is, therefore, unnecessary, recall that consistency does not necessarily imply accuracy of measurement. Some therapists may grade performances consistently high or consistently low. Such individual biases may affect the treatment program that is developed. Reliability of measurements is a particular problem for physical therapy research. The potential sources of inconsistencies, as identified in our literature and that of others, are several. The patient may be "technically difficult" to assess,2 subjects may fluctuate in their responses regardless of their cooperativeness and the skill of the examiner,3 or the motion of some joints may be more difficult to measure even though both the instrument and examiner have been shown to be reliable.4, 5 The instrument itself may be a source of inconsistency as Hellebrandt et al demonstrated in comparing several goniometers.4 Hamilton and Lachenbruch, however, did not find significant variation among the three instruments they tested.6 Many of the measurement tools in physical therapy are semiquantitative in nature, such as the rating scale, a commonly used instrument. Nevertheless, the rating scale can be a PHYSICAL THERAPY

436

Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

very useful tool in itself, given certain conditions such as similar training of examiners and standardized protocol. Also, the rating scale can be an effective intermediate step in the development of more precise tools. The number assigned to an observation, however, is often defined by qualitative descriptors subject to different interpretations and applications. For example, What is meant by Grade 2, defined in a test of mobility as a "slight restriction?" Another potential major source of variability in the measurement process is the therapist. When the process depends primarily upon the skill of the therapist, factors such as experience, training, fatigue, and other aspects of the personal equation7 may influence performance and, thus, consistency in measurement. Most studies of reliability have been in the measurement of joint motion in the extremities and of muscle strength. Few studies exist on reliability in evaluating passive mobility of the vertebral column. Information in such evaluations is obtained primarily by palpation and by the perceived movement ofjoints that have very little excursion and that are not amenable to standard goniometric measurements. Physical therapists in other countries have been long interested in the examination and treatment of back problems with manipulative techniques.8 These techniques are permeating the practice of physical therapy in the United States. Presumably, a patient can become symptom-free or experience considerable reduction in discomfort when normal mobility of the spine is restored and maintained. In evaluating mobility between two adjacent vertebrae, the patient is instructed to relax while the physical therapist passively moves and palpates the segment being tested (Figs. 1 and 2). Therapists who have been trained in the techniques of specific mobility tests and who use them daily are convinced of their reliability and of the effectiveness of related maneuvers in the treatment regimen itself. To confirm this clinical impression, Kaltenborn and Lindahl assessed the reliability of 10 instructors in examining 4 subjects attending a course on mobilization.8 Intratherapist reliability was not assessed. Kaltenborn tested each subject before and after the other 9 instructors. These 9 instructors varied in qualifications and training in manipulation therapy. Criterion for the accuracy of evaluation and consistency among the instructors was agreement with Kaltenborn, whose grading set the standard, by at least 7 of the 10 instructors, including Kaltenborn. A rating scale was used in which one of four conditions was noted: 1no movement; 2hypomobility; 3normal movement; and 4hypermobility. Movement was tested between the occiput and atlas, between four thoracic vertebrae, and between the caudal lumbar vertebrae and sacrum. Each instructor graded 13 movements. Volume 62 / Number 4, April 1982
Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

Fig. 1. Forward bending testing middle of lumbar part of the spine: therapist with middle finger palpates the forward bending produced by the position of the patient's leg.

Four of the 10 instructors were in complete agreement, 2 disagreed on two grades, a third instructor disagreed on three grades. The remaining 3 instructors disagreed on four to five grades. The results were considered by the authors as "remarkably good." Although not noted by Kaltenborn and Lindahl in their tabular information, the "errors" or misses of the 3 instructors who performed less well occurred predominantly with two of the four subjects and in testing movements in the sagittal plane in the thoracic area.8 The deviations in grading that occurred were only one grade of movement on the four-grade scale and were primarily between the grades of normal and hypomobility. Our evaluation of this study, which

Fig. 2. Rotation testing upper thoracic part of spine: therapist's thumb is placed between two adjacent spinous processes to palpate movement as the head is passively turned.

437

model to document the correlation of "distorted segmental behavior"11 with specific disease processes. While reliability in evaluating mobility of the spine The results of Kaltenborn and Lindahl8 are not through palpatory cues is common to the studies of exactly comparable to findings noted by Iddings et 9, 10 and Kaltenborn and Lindahl,8 the exal.2 A comparison of their studies shows percentages Johnston of agreement among the therapists2 and instructors8 amining procedures, orientation, and methods of are similar, but differences exist in the number of study differ, and thus, a comparative analysis is difgradations of the scale (132 vs 48), in the test sample ficult. The study of reliability in examining segmental behavior during motions as done by osteopaths is (patients2 vs normal subjects8), and, of course, in the complicated by 1) their theoretic perspective that evaluation procedure assessed. patterns of systemic changes accompany distorted In the few studies reported in the osteopathic lit- segmental behavior and 2) their related concern with erature, the focus apparently shifted from interex- stability of cues. Unstable cues are defined as "those aminer reliability9 to the need to distinguish between which do not persist in a reliable fashion across the stable and unstable cues.10 In his 1976 study, Johnston span of several motion testing procedures by a single identified variables in procedure that would improve examiner, or across testing by multiple examiners."10 percentages of agreement between an experienced Presumably, the bases for unstable cues are in examiner and two student examiners.9 For example, "transient neuromusculoskeletal processes" either not the experienced examiner marked the skins of pa- yet well organized, not well established in time, or so tients at the level he based his findings to ensure that slight that the examiner is not confident in the obthe student examiners were observing the same area. served behavior.11 Interestingly, intraexaminer reliaExaminers, using three types of procedures in motion bility was not reported (the reason for its absence is testing (respiratory excursion of the patient and two unclear); thus, its interaction with stability of cues translatory motions) and attending to cues from re- remains an interesting question. sistance to the motions, indicated the presence or The research literature on the reliability of theraabsence of findings by the grade P (positive) or N pists in administering procedures that rely most heav(negative). With changes in the protocol, Johnston ily on palpatory cues is sparse; whether we can expect reported an improvement of average agreement to 54 the same degree of acceptable results as in those percent; agreement of the students with the experi- assessments using more well defined toolstools prienced examiner was 75 to 71 percent. The overall marily external to the examiner such as goniomeagreement (54%) is low compared with studies in tersremains to be determined. The questions about reliability in general and with those specifically re- reliability that concerned the investigators in the studported here. Johnston suggests the use of interexam- ies reported above (expressed directly or implied from iner reliability,10 as he describes it, with a statistical their findings) also concerned us. These questions

showed that 7 instructors were in close agreement and 3 were not, is less than "remarkably good."

Grade 0 Ankylosed

Description

Criteria No detectable movement within the segment. Requires stress film radiology for confirmation. Significant decrease in expected range. Significant resistance to movement. Limitation expected in range. Some resistance to movement. Expected range for body type. Uniform movement throughout range. Some increase expected in range. Less than normal resistance to movement. Excessive range (but eventually restricted by capsular and ligamental structure). Excessive range (as in Grade 5) but without the restraint of capsular and ligamental structure.

1 2 3 4 5 6

Considerable restriction (hypomobility) Slight restriction (hypomobility) Normal Slight increase (hypermobility) Considerable increase (hypermobility) . Unstable

Fig. 3. Rating system for evaluating passive mobility of the spine. 438 PHYSICAL THERAPY
Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

may be grouped under two general sources of unreliability: criterion variability and information variability. In criterion variability, differences in evaluation are examined that may occur because of the examiner or the procedure. In this group, we asked the following: 1. Are therapists consistent in their grading of the same subject in two separate examination sessions? 2. Are therapists trained in a clinical evaluation procedure consistent among themselves in the evaluation procedure? 3. Does a difference in grading occur because of the maneuver itself; for example, does a maneuver such as forward bending show a greater variation in grades than other maneuvers? A by-product of this investigation would be information about the instrument itself. In information variability, the differing amounts and types of information presented by the patient's status are examined. Here, we asked the following: 1. Are some patients, as evidenced by the variations in the grades given, more difficult to grade than others? 2. Do some vertebral segments present problems for grading (ie, does one segment show a wider variation in grades than another)?

reached on the criteria for grading passive motion of the vertebral column and the administrative aspects of the experimental protocol refined, the formal study of reliability of the therapists in the evaluation procedure was conducted.

Experiment
Therapists (n = 5 men) were randomly assigned to subjects (n = 5 women) at two examination sessions. One therapist had approximately 20 years of experience, two therapists each had 3 years of experience; the remaining two had 4 and 5 years of experience. The five subjects were physical therapy students ranging in age from 22 to 27. Criteria for participation were two: absence of both a history of back pain and an endomorphic structure. Restriction of body type was intended to minimize subjects as a source of variance. The segments from T12 to S1 were selected for two reasons: feasibility to conduct the study and the high incidence of low back problems seen by physical therapists. Maneuvers used were forward bending, side bending left and right, and rotation left and right. Two evaluation sessions were held 13 days apart. A minimum of one week between sessions was considered necessary to allow dissipation of any possible soreness caused by repeated examinations. Each session was conducted in late afternoon and comprised two evaluations of each subject by each therapist. The therapist performed under two conditions within each evaluation session: a "blind" and a "normal" condition. The intent of the blind condition was to eliminate cues that would help the therapist to remember the results of the examination. Such recall could introduce the possibility of spuriously high correlations in performance from one session to the next. In the blind condition, the therapist's eyesight was occluded with a blackened scuba diving mask. Questions could be asked by the therapist when necessary but only in a format that would require a yes or no answer from the subject. An assistant helped in positioning the subject and recording the grades. The blind condition is artificial because therapists do rely on other information to confirm their evaluations; for example, observing body type and asking questions about activity level. Including this condition, however, permitted the assessment of intratherapist reliability across the two evaluation sessions. The blind condition preceded the normal condition for all subjects at each examination session. The order of the presentation of conditions ("blind" to "normal") was reasoned to be less likely to provide cues to the therapist from one examination session to the other. In the normal condition, therapist's vision was not occluded; subjects, however, wore a headcover, mask, and a hospital gown that opened from the back.

METHOD Pilot Study


A preliminary study was conducted with the clinical rating form currently used by the clinicians who participated in the study. The clinical form is a sevenpoint ordinal rating scale (0-6); this scale evolved from that used by Kaltenborn and Lindahl,8 Paris,12, 13 and Stoddard14 (Fig. 3). The reference point for clinical use of the scale is the expected normal for the patient when age, body type, and activity level are considered. In practice, therapists often place qualifiers (+ and signs) after the numeric grade (eg, 2+ or 3-). An unexpected finding was the lack of complete agreement on the verbal description of the numeric grade among the five physical therapists although they were trained similarily in the procedure and had been using the form in the clinic for several years. Evidently, some idiosyncratic behaviors had developed with clinical experience and with changes in rationale that may evolve with experience. Criteria were, therefore, developed with the therapists in an effort to establish consistency in interpreting the scale. Note that among the criteria identified (Fig. 3) the therapists included a sensing of movement through resistance of the tissue to the palpating finger, a difficult aspect to assess and one that would be influenced by clinical experience. After consensus was

Volume 62 / Number 4, April 1982


Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

439

TABLE 1 Therapists Scores by Vertebral Level, Subject, and Session in Evaluating Passive Intervertebral Motion: Blind Condition, Forward Bending Maneuver
Session 1 Vertebral Level Subject 1 T12-L1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 3.0 3.5 3.5 3.5 2.0 3.0 2.5 3.0 3.0 2.0 3.0 2.0 2.5 3.5 2.0 2.5 2.0 3.0 3.5 3.0 1.0 3.0 3.0 3.5 3.5 3.0 2.0 4.0 2.0 3.5 2 3.0 2.5 3.0 3.0 2.5 3.0 2.0 2.5 2.5 2.5 2.5 3.0 3.0 2.5 2.0 2.5 2.5 2.5 3.0 3.0 3.0 2.0 3.0 3.0 2.5 2.5 3.0 3.0 2.0 3.0 Therapist 3 3.5 1.5 3.5 3.0 2.5 3.0 2.5 3.5 2.5 2.5 2.5 2.0 3.0 3.0 2.5 3.0 2.5 2.0 3.0 2.0 2.5 2.5 2.5 3.0 2.0 2.0 2.0 2.5 2.5 2.5 4 3.0 3.5 3.5 3.0 2.5 3.0 1.5 2.5 3.0 1.5 2.5 2.0 3.0 3.0 1.5 3.0 3.0 2.5 3.0 3.0 2.0 3.0 3.0 3.0 2.5 3.0 3.0 3.0 3.0 2.0 5 2.5 3.0 3.5 2.5 2.5 2.5 2.0 3.0 2.5 1.5 3.0 2.5 3.0 2.5 2.0 3.0 2.5 3.0 2.5 2.5 2.5 2.0 3.0 3.0 1.5 3.5 3.0 2.5 2.5 2.5 1 3.5 3.5 3.5 2.5 3.0 3.5 2.0 2.0 2.5 3.0 3.0 2.0 2.0 3.5 2.5 3.0 2.0 3.0 3.0 3.0 3.0 2.5 3.5 4.0 3.0 2.0 3.0 3.5 2.0 4.0 2 2.5 3.0 3.0 3.0 3.0 3.0 2.5 3.0 3.0 3.0 3.0 2.5 2.5 2.5 2.5 2.5 2.5 3.0 2.5 2.5 3.0 2.5 2.5 3.0 3.0 2.0 3.0 3.0 2.5 3.0 Session 2 Therapist 3 3.0 2.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 2.0 3.0 2.5 3.0 3.0 2.0 2.5 2.0 2.5 2.5 1.5 2.5 2.5 3.0 2.5 1.5 2.0 2.5 2.5 2.0 1.5 4 3.0 3.5 3.0 2.5 1.0 3.0 2.0 3.0 3.0 2.0 2.5 1.0 3.0 3.0 2.0 3.0 2.0 3.0 3.0 2.0 2.0 3.0 3.0 3.0 3.0 2.5 3.0 3.0 3.0 2.0 5 3.0 2.5 2.5 2.0 2.5 3.0 1.5 2.5 3.0 2.0 3.0 2.0 2.5 3.5 2.0 2.5 2.5 2.5 3.0 1.5 2.5 2.5 3.0 3.0 2.0 3.0 3.0 2.5 2.5 2.5

L1-L2

L2-L3

L3-L4

L4-L5

L5-S1

Analysis
The plus-minus qualifiers to the rating system were converted to the midpoint of the numeric scale units, after a discussion with the participating therapists. For example, a 3 - or a 2+ would be recorded as a 2.5. Therapists determined, based on their experience, that a segment rated as 3 or 2+ would be similarly treated. In addition, the therapists identified a oneunit difference between the ratings as the limit of reliability. Any greater deviation could mean a difference in the clinical decision in treatment. Thus, a one-unit deviation (for example, 2 to 2+) as a criterion for a clinically acceptable agreement was used to determine the reliability of performance. Grade 3 (normal) was the reference point for deviations. The grading scale is not continuous and thus precluded the use of standard correlational analysis. The data were summarized descriptively by means and standard deviation. 440

RESULTS As in the pilot study, the range of scores was truncated; that is, out of the seven-point scale (0-6) examiners used only the grades of 1-4 (Tab. 1). In a random sample of patient evaluations, a similar restricted use of the scale was also found. Intratherapist consistency between two evaluation sessions, between blind and normal conditions, and among maneuvers, was reasonably good. As one example of the raw data supporting this finding between sessions, therapist reliability by sessions and subjects for the blind condition and the maneuver of forward bending are shown in Table 1. In most instances, each therapist agreed with himself within a one-unit deviation from the first session (Day 1) to the second session (Day 2). Results were averaged over the sessions, conditions, and maneuvers (Tab. 2). In comparison with each other and using the Grade 3 reference point, consistency among the therapists was not demonstrated (Tabs. 2 and 3). Although each PHYSICAL THERAPY
Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

therapist was consistent with himself, each showed a particular bias or set in his measurements (Tabs. 2 and 3). For example, Therapist 2 averaged the second closest to Grade 3 and generally exhibited the least variability. Therapist 3 generally averaged the farthest from Grade 3 and was also more variable than Therapist 2. Therapist 1 averaged the closest to Grade 3 but was the most variable. Subject variability was evident with Subjects 2 and 5 exhibiting the greatest deviations from Grade 3 and also more variability than the other three subjects (Tabs. 2 and 3). The average of the grades received

by the subjects was generally below 3. According to the grading system, most of the deviations recorded by the therapists were for hypomobility. In only a few instances did subjects receive a grade for hypermobility, and these scores were given primarily by Therapist 1 (Tab. 1). Notable differences in grading because of the maneuver itself (ie, a wider variation in some maneuvers over others) were not evident. Variability in deviations by spinal level was apparent, however. Variability was lowest for L1-L2 and L2-L3 and highest for L5-S1 (Tabs. 2 and 3).

TABLE 2 Therapists Scores by Vertebral Level and Subject (N = 20)


Therapist Vertebral Level and Subject 1 s T12-L1 S1 S2 S3 S4 S5 of L1-L2 S1 S2 S3 S4 S5 of L2-L3 S1 S2 S3 S4 S5 of L3-L4 S1 S2 S3 S4 S5 of L4-L5 S1 S2 S3 S4 S5 of L5-S1 S1 S2 S3 S4 S5 of s 3.00 3.30 3.50 2.83 3.65 3.26 .36 .66 .43 .54 .40 2.60 2.75 2.80 2.63 2.78 2.71 .35 .30 .25 .32 .26 2.23 1.63 2.33 2.03 1.30 1.90 .34 .32 .34 .44 .38 2.98 3.00 2.95 3.00 2.43 2.87 .11 .00 .22 .00 .47 2.63 2.68 2.68 2.35 2.45 2.56 .36 .24 .24 .29 .28 2.69 2.67 2.85 2.57 2.52 2.66 s 2.45 2.78 3.20 3.28 3.68 3.08 .63 .41 .38 .47 .37 2.68 2.53 2.58 2.83 2.68 2.66 .29 .30 .29 .29 .29 2.73 2.15 2.60 2.15 1.50 2.23 .26 .33 .35 .46 .16 2.18 2.83 2.60 2.43 2.88 2.58 .41 .37 .50 .47 .28 2.40 2.15 2.60 2.58 2.28 2.40 .21 .29 .31 .44 .30 2.49 2.49 2.72 2.65 2.60 2.59 s 2.85 2.40 2.68 3.15 3.15 2.85 .76 .31 .49 .40 .29 2.65 2.58 2.73 2.73 2.73 2.68 .33 .24 .26 .26 .26 2.05 1.75 2.38 2.53 1.65 2.07 .51 .34 .36 .44 .29 2.35 2.20 2.28 2.98 2.48 2.46 .59 .44 .47 .11 .50 2.40 2.25 2.23 2.63 1.93 2.28 .35 .30 .41 .31 .34 2.46 2.40 2.46 2.80 2.39 2.47 s 2.78 2.18 2.30 2.98 2.78 2.60 .44 .41 .38 .41 38 2.75 2.63 .273 2.85 2.78 2.75 .26 .22 .26 .24 .30 2.70 1.98 2.85 2.93 2.08 2.51 .30 .44 .29 .18 .41 2.80 2.03 2.85 2.95 1.85 2.50 .30 .47 .29 .22 .37 2.60 2.18 2.48 2.85 2.05 2.43 .35 .24 .38 .37 .36 2.73 2.20 2.64 2.91 2.31 2.56 s 3.00 2.05 2.48 3.00 2.55 2.62 .28 .32 .47 .23 .39 2.90 2.78 2.83 2.85 2.95 2.86 .21 .30 .24 .24 .15 2.95 2.95 3.00 2.95 2.40 2.85 .15 .15 .32 .32 .31 3.00 1.65 2.95 3.00 2.03 2.53 .00 .40 .15 .00 .41 2.80 2.00 2.83 2.75 2.15 2.51 .30 .51 .49 .26 .40 2.93 2.29 2.82 2.91 2.42 2.67 s 3.23 3.33 3.33 2.90 2.58 3.07 .41 .59 .80 .42 .47 2.93 2.85 2.95 2.95 2.78 2.89 .18 .24 .15 .15 .26 3.05 1.53 3.20 2.98 2.50 2.65 .22 .11 .25 .11 .46 3.00 3.45 3.28 2.95 2.55 3.05 .00 .15 .30 .32 .67 2.95 2.83 3.03 2.75 2.25 2.76 .28 .36 .34 .41 .34 3.03 2.80 3.16 2.91 2.53 2.88 2
S

3 s

4 s

5 s of s

Volume 62 / Number 4, April 1982


Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

441

TABLE 3 Means and Standard Deviations of Grades by Therapists, Subjects, and Vertebral Levels
Grades Variables s Therapist (n = 600) 1 2 3 4 5 Subject (n = 600) 1 2 3 4 5 Verbebral (n = 500) level T12-L1 L1-L2 L2-L3 L3-L4 L4-L5 L5-S1 2.91 2.76 2.37 2.66 2.49 2.71 2.44 2.77 2.79 2.46 .61 .28 .62 .55 .44 .45 .61 .48 .43 .64

Are therapists trained in the evaluation procedure consistent among themselves in evaluating passive mobility of the vertebral column?
Our results on intertherapist reliability were disappointing and thought provoking. Comparison with other studies is difficult because of differences in protocol. Nevertheless, we consider our results not to have demonstrated comparable reproducibility. Partial explanations for this lack of reproducibility are the degree of precision of the tool used and the standards available for comparison. For example, there is less room for variation in applying a goniometer and reading the value from the goniometer than in a procedure that relies on palpatory cues and visual judgments on very small movements. An evaluation tool that is based on objectively measurable events will be usually more reliable than one in which the event is described in more qualitative terms, subject then to differences in interpretation and the possible effects of experience. The inconsistency found was surprising because all examiners were similarly trained, although at different times. But the inconsistency appears no less than that demonstrated by the instructors in the Kaltenborn and Lindahl study.8 A direct comparison of our studies is not possible, however, because the scales differed (16 points, including the qualifiers of vs a 4-point scale8) and the size of the unit differed. The scale used in our study contained more gradations; for example two values for hypomobility (excluding the qualifiers ) versus one value.8 The unit size deviation considered acceptable by our therapists was also smaller than that used by Kaltenborn and Lindahl.8 Unlike Kaltenborn and Lindahl,8 we did not use the grading of the most experienced therapist as a standard for comparison. Although Therapist 1 developed the clinical form and trained the therapists in its use at various times, at the time of the study this therapist was more involved in teaching and administration and, therefore, was not treating patients routinely as were the other therapists. Experience with regular practice was assumed by us to be as important for consistency in performing any evaluative technique as it is for performing a motor skill. Why then was the performance among therapists variable? The only explanation we can offer is the possibility that the more experienced therapists have developed some idiosyncratic behaviors as a consequence of a freedom and opportunity they were given to study other philosophies and procedures in mobilization of the back. The two therapists (4 and 5) who agreed more with each other than with the remaining therapists had, at the time, relatively less experience (including teaching mobilization, a responsibility of the therapists in this clinic). Evidently, the agreement

2.88 2.67 2.56 2.47 2.59 2.66

.53 .49 .48 .54 .55 .61

DISCUSSION
This discussion follows the order of the questions asked, and the presentation of results, with one exception. Comments on the evaluation instrument will be presented lastour primary purpose was the assessment of the reliability of therapists with an instrument already in use.

Are therapists consistent in evaluating passive mobility of the vertebral column of the same subject in two separate examinations?
Our finding that intratherapist reliability was acceptable and less variable than intertherapist reliability supports the results of other studies.2, 4-6 Given these repeated findings across studies, acceptable intratherapist reliability is a reasonable assumption, once a skill is learned and a prescribed protocol is followed. We would expect also sufficient reliability of a graduating student in administering evaluation procedures taught at a specific level of preparation. These assumptions should be tested because of the importance of reliable evaluations for therapeutic programing. We suggest that the need to monitor intratherapist reliability routinely would depend upon the clinical implications of any bias evident in a therapist's performance (eg, high or low scores, the accuracy of the standard against which measurements are compared, the influence of experience or variable practice on the skill of using a measuring tool, and the level of precision of the instrument itself).

442

PHYSICAL THERAPY
Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

after the pilot study on the qualitative operational definitions for each grade level was not enough to control for the potentially high subjectivity of palpation and observation of movements at joints with very little excursion. Another possible explanation for the variability was offered by one of us (SVP) who states the clinical evaluation scale as currently designed is considered by some to be an excellent tool for inexperienced therapists, allowing them to record the mechanical characteristics of the segments being moved. The experienced clinician, however, tends, to a greater extent, to incorporate into his evaluation procedure other palpatory cues such as resistance to movement, character of the movement, and even the texture of the structure being palpated. Also suggested (SVP) is a tendency of more experienced clinicians to grade an observation lower than it may be in order to bring the dysfunction to their attention clinically. These speculations of behaviors, personalized with experience, will be assessed in a forthcoming effort to increase consistency of performance among the therapists. A training program has been proposed in which the therapists will together evaluate the mobility of the vertebral column of several patients, discuss their differences at the time of the evaluations, modify the descriptors in the rating form if necessary, and reconsider the palpatory techniques being used.

area more difficult to palpate, the spinous process of S1 is not as pronounced as L5, and L5-S1 is also subject to more anomalies than other segments. In the lumbar spine, all vertebrae have a similar range of movement in forward and backward bending. The test position for forward and backward bending of L5-S1, in which patients lie on their side with hips and knees flexed, places the L5-S1 in a forward bending position. Thus, this position decreases the available range in forward bending and L5-S1 then is inclined to feel tighter to some therapists, accounting possibly for the variation in the results obtained at this segment. Variability because of the test procedure can be reduced and future efforts will be directed toward that objective.

What is a normal back?


In addition to the specific questions about the reliability of therapists in evaluating passive mobility of vertebral segments and potential sources of unreliability, two other aspects deserve brief comment: the use of "normal" subjects and the instrument itself. One might have hypothesized (wrongly) that interrater reliability would be very high because the subjects were "normal" and deviations from normal should have been few. As both the Kaltenborn and Lindahl study8 and ours showed, however, the normal back in these young adults is not "normal" in the sense of being so graded. Various degrees of hypomobility were recorded in both studies. This finding raises the interesting question, What is a normal back? At this stage of our concern about the reliability of the evaluation procedure (and its validity), normal subjects were more appropriate than patients. Consider: each subject was evaluated completely in two separate sessions for Day 1 and for Day 2, with six segments being examined through five different maneuvers at least three times for each maneuver by five physical therapists. Our design strategy would not have been feasible with patients; in addition, patients as subjects are always a concern because changes may occur in behaviors being assessed simply as a result of the testing procedure itself. Nevertheless, we believe development of more practical ways to assess reliability clinically and scientifically is a worthwhile and a necessary endeavor.

Are some patients more difficult to grade than others?


That some patients/subjects can be more difficult to assess than others has already been documented in the literature,2 and made evident in the Kaltenborn and Lindahl study (though not discussed by them).8 This range in difficulty was demonstrated also in this study, in spite of our efforts to control for body type. The therapists recalled that one of our difficult subjects tended towards obesity. Unfortunately, our subjects had graduated before the data analysis was completed. In addition to those with obesity, persons with spinal anomalies or who cannot relax have been identified as presenting difficulties for evaluation of the back (SVP). In any subsequent study with normal subjects, one suggestion for controlling for body type would be the use of calipers to assess the amount of body tissue.

Are some vertebral segments more difficult to grade than others?


The variability found in L5-S1, which was greater than in the remaining segments, can be attributed to two sources: anatomy and the testing procedure itself. Anatomically, the disproportionate space between the spinous processes of these two vertebrae makes this

Why a seven-point scale?


Aside from the need to reassess the operational definitions of the numeric scale of the clinical form and possibly the validity of some of the procedures, one might also ask about the need for a seven-point (0-6) scale, excluding the plus-minus qualifiers. Recall that only values of 1 through 4 were recorded,

Volume 62 / Number 4, April 1982


Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

443

with clustering from 2 to 3.5, and that a random sampling of patient evaluations produced a similar pattern. The other values0, 5, and 6are still necessary, although not used frequently. The value of zero (ankylosed) would be recorded for patients with congenital fusion, surgical fusion, and fusion following tuberculosis or other bony infection. Grade 5 (hypermobility) is not uncommon above a fusion site or in rheumatoid arthritis in the cervical part of the spine. Grade 6 (unstable) is extremely rare but when present denotes a need for immediate immobilization or surgery; for example, fracture of odontoid processes of axis and rupture of associated ligaments following severe trauma in a motor vehicle accident. The range of the scale, then, is considered appropriate for clinical purposes. With further clarification of the meaning of each value on the scale and of the application of the evaluation techniques, interrater reliability should improve. While the clinical evaluation form may become more "sensitive" as a result, the precision achievable is limited by the nature of the scale. There is a point at which any greater precision in an instrument evaluating behavior becomes meaningless. The degree of precision necessary should be dictated by its clinical helpfulness and reliability in application.

In summary, this study was intended as a precursor to one in which the effectiveness of various treatment approaches would be assessed. Such a study is not precluded by our results because intratherapist reliability for testing passive intervertebral motion was demonstrated to be acceptable. The disappointing results on intertherapist reliability do mean that in any such study the therapist as a source of variation would have to be held constant (eg, all the evaluations might be done by one therapist). A better solution in the long run than keeping the therapist variable constant would be to develop acceptable intertherapist reliability. One advantage would be the availability of sufficient clinical data (retrospectively or prospectively) to help clarify the clinical recovery course of various conditions and to determine the effectiveness and timeliness of therapeutic interventions. Several reasons have been suggested for the biases demonstrated in intratherapist reliability and for the inadequate intertherapist reliability. In a follow-up study that will incorporate a training program, we expect these results to improve. Acknowledgments. We wish to thank the physical therapy students, Georgia State University (1979), and the physical therapists, Atlanta Back Clinic, who participated in this study.

REFERENCES
1. Hewitt D: The range of active motion at the wrist of women. J Bone Joint Surg 26:775-787, 1928 2. Iddings DM, Smith LK, Spencer WA: Muscle testing: Part 2. Reliability in clinical use. Phys Ther Rev 41:249-256, 1961 3. Cobe HM: The range of active motion at the wrist of white adults. J Bone Joint Surg 26:763-774, 1928 4. Hellebrandt FA, Duvall EN, Moore ML: The measurement of joint motion. Part III. Reliability of goniometry. Phys Ther Rev 29:302-307, 1949 5. Boone DC, Azen SP, Lin CM, et al: Reliability of goniometric measurements. Phys Ther 68:1355-1360, 1978 6. Hamilton GF, Lachenbruch PA: Reliability of goniometers in assessing finger joint angle. Phys Ther 49:465-469, 1969 7. Boring EG: History, Psychology and Science. New York, NY, John Wiley and Sons Inc, 1963 8. Kaltenborn F, Lindahl O: Reproducibility of the results of manual mobility testing of specific intervertebral segments. 14. Swedish Medical Journal (Lakartidningen) 66:962-965, 1969 Johnston WL: Interexaminer reliability in palpation. Journal of American Osteopathic Association 76:286-287, 1976 Johnston WL, Elkess ML, Marino RV: A statistical model for evaluating stability of palpatory cues. Journal of American Osteopathic Association 77:473-474, 1978 Johnston WL: Segmental behavior during motion: III. Extending behavioral boundaries. Journal of American Osteopathic Association 72:462-474, 1973 Paris SV: The Spinal Lesion. Christchurch, New Zealand, Pegasus Press Ltd, 1964 Paris SV: Gross spinal movements and their restriction as the basis of joint manipulation. Proceedings of the Sixth International Congress of the World Confederation for Physical Therapy Association. Amsterdam, The Netherlands, Van Gorcum BV, 1971, pp 2 0 8 - 2 1 3 Stoddard A: Manual of Osteopathic Technique. London, England, Hutchinson Medical Publications, 1959

9. 10.

11. 12. 13.

444

PHYSICAL THERAPY Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

Reliability in Evaluating Passive Intervertebral Motion Carmella Gonnella, Stanley V Paris and Michael Kutner PHYS THER. 1982; 62:436-444.

Cited by

This article has been cited by 5 HighWire-hosted articles: http://ptjournal.apta.org/content/62/4/436#otherarticles

Subscription Information

http://ptjournal.apta.org/subscriptions/

Permissions and Reprints http://ptjournal.apta.org/site/misc/terms.xhtml Information for Authors http://ptjournal.apta.org/site/misc/ifora.xhtml

Downloaded from http://ptjournal.apta.org/ by guest on March 20, 2012

Das könnte Ihnen auch gefallen