Beruflich Dokumente
Kultur Dokumente
The ordinary Pearson correlation coefficient, as well as the nonparametric correlation coefficients,
assume that each x-y pair of numbers is independent from the other x-y pairs. Both the
interpretation of the correlation coefficient itself, as well as the significance test for the coefficient,
require this assumption.
Bland and Altman published three one-page articles discussing this situation.
In their first paper (Bland and Altman, 1994), they simply present the idea that it is incorrect to
compute a correlation coefficient on data that uses more than one x-y pair per study subject (the
repeated measurement situation).
In their second paper (Bland and Altman, 1995a), they describe how to analyze such data if the
interest is a within subjects research question. For example, suppose you measured pH and
Paco2 in the same patient, and you repeated this approximately four times during their hospital
stay. You have a within subjects research question if you want to know whether an increase in
pH within the same individual was associated with an increase in Paco2. (Paco2 is the symbol for
partial pressure of carbon dioxide in the arterial blood.)
In their third paper (Bland and Altman, 1995b), they describe how to analyze such data if the
interest is a between subjects research question. For this same dataset, you have a between
subjects research question if you want to know whether subjects with high values of pH also
tend to have high values of Paco2.
In this chapter, both Bland and Altmans within subjects approach and their between subjects
approach is described, along with some Stata programs for doing them.
_________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
We will need some Stata code for computing the significance test for the correlation coefficient
when we get to the between subjects analysis below. Stata does not have a command where
you can provide Stata with a correlation coefficent and sample size, after which it gives you the p
value.
The hypothesis test that the population correlation coefficient is different from zero is (Rosner,
2006, p.496):
r n-2
test statistic: t = , where r is the sample correlation coefficient
1- r2
Thus, for a two-sided = 0.05 comparison, statistical significance (p < 0.05) is achieved
when
t > tn - 2, 0.975
r = .25
n = 100
p = 0.0121
This result agrees with Rosner (2006, p.497), so you can feel confident it is programmed
correctly.
As stated above, Bland and Altman (1995a) describe how to analyze repeated measures data if the
interest is a within subjects research question. For example, you have a within subjects
research question if you want to know whether an increase in pH within the same individual was
associated with an increase in Paco2.
This section will work through the process described in Bland and Altman (1995a).
Bland and Altman (1995a) illustrate with a dataset they provide in their article (their Table 1). To
read this dataset into Stata,
File
Open
Find the directory where you copied the course CD:
Change to the subdirectory datasets & do-files
Single click on bland&altmanBMJ1995Table1.dta
Open
+------------------------+
| subject ph paco2 |
|------------------------|
1. | 1 6.68 3.97 |
2. | 1 6.53 4.12 |
3. | 1 6.43 4.09 |
4. | 1 6.33 3.97 |
|------------------------|
5. | 2 6.85 5.27 |
6. | 2 7.06 5.37 |
7. | 2 7.13 5.41 |
8. | 2 7.17 5.44 |
|------------------------|
9. | 3 7.4 5.67 |
10. | 3 7.42 3.64 |
11. | 3 7.41 4.32 |
12. | 3 7.37 4.73 |
13. | 3 7.34 4.96 |
14. | 3 7.35 5.04 |
15. | 3 7.28 5.22 |
16. | 3 7.3 4.82 |
17. | 3 7.34 5.07 |
+------------------------+
Suppose we are interested in a within subjects research question, where we want to know
whether an increase in pH within the same individual was associated with an increase in Paco2.
This can be done with analysis of covariance (ANCOVA). We make one of the two variables the
outcome variable and the other the continuous predictor variable. In addition, we make subject a
predictor variable, by using indicator variables (7 indictor, or dummy, variables to represent the 8
subjects). The anova command automatically creates these indictor variables, without showing
them to you, for any variable that is not listed in the continuous( ) option.
This ANCOVA table shows how the variability in pH can be partitioned into components due to
different sources.
predict ph_pred
and then overlay line graphs through these predicted values for each subject onto the scatterplot
of observations,
#delimit ;
twoway (scatter ph paco2 , symbol(circle) color(black))
(line ph_pred paco2 if subject==1, color(black) clstyle(solid))
(line ph_pred paco2 if subject==2, color(black) clstyle(solid))
(line ph_pred paco2 if subject==3, color(black) clstyle(solid))
(line ph_pred paco2 if subject==4, color(black) clstyle(solid))
(line ph_pred paco2 if subject==5, color(black) clstyle(solid))
(line ph_pred paco2 if subject==6, color(black) clstyle(solid))
(line ph_pred paco2 if subject==7, color(black) clstyle(solid))
(line ph_pred paco2 if subject==8, color(black) clstyle(solid))
, legend(off) ytitle("Intramural pH") xtitle(PaCO2)
;
#delimit cr
7.4
7.2
Intramural pH
6.8 6.6
6.4 7
3 4 5 6 7
PaCO2
We see that the model fit parallel lines through each subjects data.
This term is analogous to a coefficient of determination. When you square a Pearson correlation
coefficient, you have the coefficient of determination (r2) which represents the proportion of
variation in the outcome variable that is explained by the predictor variable.
To obtained the magnitude of the correlation coefficient within subjects, we take the square root
of this proportion.
0.1153
= 0.51
0.1153 +0.3337
------------------------------------------------------------------------------
ph | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
paco2 | -.108323 .0298928 -3.62 0.001 -.1688378 -.0478082
_Isubject_2 | .7046112 .0773549 9.11 0.000 .5480145 .861208
_Isubject_3 | .9500127 .0610954 15.55 0.000 .8263315 1.073694
_Isubject_4 | .9715577 .0735091 13.22 0.000 .8227464 1.120369
_Isubject_5 | .8603818 .0583954 14.73 0.000 .7421664 .9785972
_Isubject_6 | .9264284 .0659945 14.04 0.000 .7928296 1.060027
_Isubject_7 | .6921055 .1049093 6.60 0.000 .4797277 .9044834
_Isubject_8 | .7033361 .0615714 11.42 0.000 .5786913 .8279809
_cons | 6.929854 .129469 53.53 0.000 6.667758 7.19195
------------------------------------------------------------------------------
From the coefficient for paco2 (B = -.108323), we see the assocation is negative.
The p value for the correlation coefficient can be taken from the t test for the regression slope (p
= 0.001) or from the F test for the paco2 term in the ANCOVA table ( p = 0.0008).
Number of obs = 47 R-squared = 0.8993
Root MSE = .093715 Adj R-squared = 0.8781
It makes no difference which variable we choose for the dependent variable, as either way gives
an identical correlation coefficient and p value.
* ---------------------------------------
* Bland and Altman (1995)
* correlation coefficient within subjects
capture program drop withincorr
program define withincorr
version 10
syntax varlist(min=3 max=3) [if] [in]
tokenize `varlist'
local yvar `1'
local xvar `2'
local subjectid `3'
tempname touse
mark `touse' `wgt' `if' `in'
preserve
quietly keep if `touse'
quietly anova `yvar' `xvar' `subjectid' ///
, continuous(`xvar')
local r = sqrt(e(ss_1)/(e(ss_1)+e(rss)))
quietly xi: regress `yvar' `xvar' i.`subjectid'
if _b[`xvar']<0 {
local r = -`r'
}
local p = 2*ttail(`e(N)'-2,abs(_b[`xvar']/_se[`xvar']))
display as result _newline "within r = " ///
%4.2f `r' " , n = " `e(N)' " , p = " %5.4f `p'
restore
end
* syntax: withincorr y x subjectid
* ----------------------------------------
After executing the above lines to set up the withincorr command for the duration of your
current Stata session, to compute a within subjects correlation coefficient, you run the following
command,
Syntax: withincorr yvar xvar subjectid [the subject ID variable must come last]
This is what is found in the Bland and Altman (1995a) paper, so you can feel confident it was
programmed correctly.
The command also works with the if option and in option. To illustrate,
Article Suggestion
Statistical Methods
Results
In the Results section, just report it like you would any other correlation coefficient. I
doubt the reader would want to be bothered with knowing which correlation coefficients
are ordinary coefficients and which are within subjects coefficients when reading the
results.
In their third paper (Bland and Altman, 1995b), the authors describe how to obtain a correlation
coefficient for repeated measures data if the interest is a between subjects research question.
Using the same dataset provided for ther within subjects correlation, you have a between
subjects research question if you want to know whether subjects with high values of pH also
tend to have high values of Paco2.
Bland and Altman (1995b) illustrate with a dataset they provided in their second article (Bland
and Altman, 1995a).
+------------------------+
| subject ph paco2 |
|------------------------|
1. | 1 6.68 3.97 |
2. | 1 6.53 4.12 |
3. | 1 6.43 4.09 |
4. | 1 6.33 3.97 |
|------------------------|
5. | 2 6.85 5.27 |
6. | 2 7.06 5.37 |
7. | 2 7.13 5.41 |
8. | 2 7.17 5.44 |
|------------------------|
9. | 3 7.4 5.67 |
10. | 3 7.42 3.64 |
11. | 3 7.41 4.32 |
12. | 3 7.37 4.73 |
13. | 3 7.34 4.96 |
14. | 3 7.35 5.04 |
15. | 3 7.28 5.22 |
16. | 3 7.3 4.82 |
17. | 3 7.34 5.07 |
We see that each subject has multiple observations of pH and Paco2, with a variable number of
observations per subject.
For the between subjects correlation, we first convert these data to a mean value for each subject,
keeping track of how many observations were originally available per subject.
+----------------------------------------+
| subject ph paco2 number |
|----------------------------------------|
1. | 1 6.4925 4.0375 4 |
2. | 2 7.0525 5.3725 4 |
3. | 3 7.356667 4.83 9 |
4. | 4 7.326 5.312 5 |
5. | 5 7.31375 4.39875 8 |
|----------------------------------------|
6. | 6 7.323333 4.92 6 |
7. | 7 6.906667 6.603333 3 |
8. | 8 7.115 4.78375 8 |
+----------------------------------------+
The dataset is now reduced to 8 lines, with represent the mean ph and mean paco2 for each
subject, along with the original number of observations per subject, which we call number. The
number variable will be used to computed a weighted correlation coefficient.
This dataset is what is found in the third Bland and Altman paper (Bland and Altman, 1995b),
except they use only two decimal places. So that we get the same answer as they did, lets first
round to two decimal places.
replace ph=round(ph,.01)
replace paco2=round(paco2,.01)
list
+---------------------------------+
| subject ph paco2 number |
|---------------------------------|
1. | 1 6.49 4.04 4 |
2. | 2 7.05 5.37 4 |
3. | 3 7.36 4.83 9 |
4. | 4 7.33 5.31 5 |
5. | 5 7.31 4.4 8 |
|---------------------------------|
6. | 6 7.32 4.92 6 |
7. | 7 6.91 6.6 3 |
8. | 8 7.11 4.78 8 |
+---------------------------------+
| ph paco2
-------------+------------------
ph | 1.0000
|
| 47
|
paco2 | 0.0765 1.0000
| 0.6095
| 47 47
As pointed out by Bland and Altman, we cannot use this p value. The p value was based on n =
47 independent observations. We only have n = 8 independent observations. That is, we reduced
our correlated repeated measurements into a single observation per subject, which resulted in 8
independent observations.
To compute the p value, we use the program provided on page 2 above. Running these lines in
the do-file editor sets it up for us.
pearsonpval .0765 8
r = .0765
n = 8
p = 0.8571
This is what Bland and Altman reported in their paper (r = .08 , p = 0.9).
* ---------------------------------------
* Bland and Altman (1995)
* correlation coefficient between subjects
capture program drop betweencorr
program define betweencorr
version 10
syntax varlist(min=3 max=3) [if] [in]
tokenize `varlist'
local yvar `1'
local xvar `2'
local subjectid `3'
tempname touse
mark `touse' `wgt' `if' `in'
preserve
quietly keep if `touse'
quietly drop if `subjectid'==.
tempvar number
gen `number'=1
collapse (mean) `yvar' `xvar' (sum) `number' , by(`subjectid')
quietly corr `yvar' `xvar' [fw=`number'] // returns r(rho)
local r=r(rho)
quietly count if `yvar'~=. & `xvar'~=. // returns r(N)
local t=`r'*sqrt(r(N)-2)/sqrt(1-`r'^2)
local p=2*ttail(r(N)-2,abs(`t'))
display as result _newline "between r = " ///
%4.2f `r' " , n = " `r(N)' " , p = " %5.4f `p'
restore
end
* syntax: betweencorr y x subjectid
* ----------------------------------------
After executing the above lines to set up the withincorr command for the duration of your
current Stata session, to compute a within subjects correlation coefficient, you run the following
command,
Syntax: betweencorr yvar xvar subjectid [the subject ID variable must come last]
This is what is found in the Bland and Altman (1995b) paper, so you can feel confident it was
programmed correctly. (We get r=0.07, instead of 0.08 this time because we did not round the
data off to two decimals to begin with.)
Article Suggestion
Statistical Methods
In the Results section, just report it like you would any other correlation coefficient. I
doubt the reader would want to be bothered with knowing which correlation coefficients
are ordinary coefficients and which were originally repeated measurements when reading
the results.
Bland JM, Altman DG. (1995a). Calculating correlation coefficients with repeated observations:
Part 1correlation within subjects. BMJ 310:446.
Bland JM, Altman DG. (1995b). Calculating correlation coefficients with repeated observations:
Part 2correlation between subjects. BMJ 310:633.
Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Thomson Brooks/Cole.