Sie sind auf Seite 1von 6

Interpretation of Somers D under four simple models

Roger B. Newson 27 May, 2009

Introduction

Somers D is an ordinal measure of association introduced by Somers (1962)[8]. It can be dened in terms of Kendalls a (Kendall and Gibbons, 1990)[3]. Given a sequence of bivariate random variables (X, Y ) = {(Xi , Yi )}, sampled using a sampling scheme for sampling pairs of bivariate pairs, and with identical marginal distributions, Kendalls a is dened as (X, Y ) = E [[sign(Xi Xj )sign(Yi Yj )] (1)

(where E [] denotes expectation), or, equivalently, as the dierence between the probability that the two X, Y pairs are concordant and the probability that the two X, Y pairs are discordant. A pair of X, Y pairs is said to be concordant if the larger X value is paired with the larger Y value, and is said to be discordant if the larger X value is paired with the smaller X value. Somers D of Y with respect to X is dened as D(Y |X ) = (X, Y )/ (X, X ) (2)

or, equivalently, as the dierence between the two conditional probabilities of concordance and discordance, assuming that the 2 X values are unequal. Note that Kendalls a is symmetric in X and Y , whereas Somers D is assymmetric in X and Y . Somers D plays a central role in rank statistics, and is the parameter behind most of these nonparametric methods, and can be estimated with condence limits like other parameters. (See Newson (2002)[6] and Newson (2006)[5] for details.) However, many nonstatisticians appear to have a problem interpreting Somers D, even though a dierence between proportions is arguably a simpler concept than an odds ratio, which many of them claim to understand better. Parameters are often easier to understand if they play a specic role in a specic model. Fortunately, in a number of simple standard models, Somers D can be derived from another parameter by a transformation. A condence interval for Somers D can therefore be transformed, using inverse endpoint transformation, to give a robust, outlierresistant condence interval for the other parameter, assuming that the model is true. We will discuss 4 simple models for bivariate X, Y pairs: Binary X , binary Y . Somers D is then the dierence between proportions. Binary X , continuous Y , constant hazard ratio. Somers D is then a transformation of the hazard ratio. Binary X , Normal Y , equal variances. Somers D is then a transformation of the mean dierence divided by the common standard deviation (SD). Bivariate Normal X and Y . Somers D is then a transformation of the Pearson correlation coecient. Each of these cases has its own Section, and a Figure illustrating the transformation.

Binary X , binary Y

We assume that there are two subpopulations, Subpopulation A and Subpopulation B , and that X is a binary indicator variable, equal to 1 for observations in Subpopulation A and 0 for observations in Subpopulation B , and that Y is also a binary variable, equal to 1 for successful observations and 0 for failed observations. Dene pA = Pr(Y = 1|X = 1), pB = Pr(Y = 1|X = 0) (3) 1

Interpretation of Somers D under four simple models to be the probabilities of success in Subpopulations A and B , respectively. Then Somers D is simply D(Y |X ) = pA pB ,

(4)

or the dierence between the two probabilities of success. Figure 1 gives Somers D as the trivial identity function of the dierence between proportions. Note that Somers D is expressed on a linear scale of multiples of 1/12, which (as we will see) is arguably a natural scale of reference points for Somers D.

Binary X , continuous Y , constant hazard ratio

We assume, again, that X indicates membership of Subpopulation A instead of Subpopulation B , and assume, this time, that Y has a continuous distribution in each of the two subpopulations, with cumulative distribution functions FA () and FB (), and probability density functions fA () and fB (). We imagine Y to be a survival time variable, although we will not consider the possibility of censorship. In the two subpopulations, the survival functions and the hazard functions are given, respectively, by Sk (y ) = 1 Fk (y ), hk (y ) = fk (y )/Sk (y ), (5)

where y is in the range of Y and k {A, B }. Suppose that the hazard ratio hA (y )/hB (y ) is constant in y , and denote its value as R. (This is trivially the case if both subpopulations have an exponential distribution, with hk (y ) = 1/k , where k is the subpopulation mean. However, it can also be the case if we assume some other distributional families, such as the Gompertz or Weibull families, or even if we do not assume any specic distributional family, but still assume the proportional hazards model of Cox (1972)[1].) Somers D is then derived as D ( Y |X ) =
y

hB (y )SA (y )SB (y )dy

y y

R hB (y )SA (y )SB (y )dy R hB (y )SA (y )SB (y )dy

h (y )SA (y )SB (y )dy + y B

= (1 R)/(1 + R).

(6)

For nite R, this formula can be inverted to give the hazard ratio R as a function of Somers D by R = [1 D(Y |X )] / [1 + D(Y |X )] = [1 c(Y |X )] /c(Y |X ), (7)

where c(Y |X ) = [D(Y |X ) + 1]/2 is Harrells c-index, which reparameterizes Somers D to a probability scale from 0 to 1 (Harrell et al., 1982)[2]. Note that, for continuous Y and binary X , Harrells c is the probability of concordance, and that a constant hazard ratio R is the corresponding odds against concordance. Figure 2 gives Somers D as a function of R. Note that R is expressed on a log scale, similarly to the standard practice with logistic regression. Somers D of lifetime with respect to membership of Population A is seen to be a decreasing logistic sigmoid function of the Population A/Population B log hazard ratio, equal to 0 when the log ratio is 0 and the ratio is therefore 1. A hazard ratio of 2 (as typically found when comparing the lifetimes of cigarette smokers as Population A to lifetimes of nonsmokers as Population B) corresponds to a Somers D of -1/3, or a Harrells c of 1/3. Similarly, a hazard ratio of 1/2 (as typically found when comparing lifetimes of nonsmokers as Population A to lifetimes of cigarette smokers as Population B ) corresponds to a Somers D of 1/3, or a Harrells c of 2/3. Therefore, although a smoker may possibly survive a nonsmoker of the same age, the odds are 2 to 1 against this happening.

Binary X , Normal Y , equal variances

We assume, again, that X indicates membership of Subpopulation A instead of Subpopulation B , and assume, this time, that Y has a Normal distribution in each of the two subpopulations, with respective means A and B and standard deviations (SDs) A and B . Then, the probability of concordance (Harrells c) is the probability that a random member of Population A has a higher Y value than a random member of Population B , or (equivalently) the probability that the dierence between these two Y values is positive. 2 2 This dierence has a Normal distribution, with mean A B and variance A + B . Somers D is therefore given by the formula A B D(Y |X ) = 2 (8) 1, 2 + 2 A B where () is the cumulative standard Normal distribution function. If the two SDs are both equal (to = A = B ), then this formula simplies to D(Y |X ) = 2 A B 2 1 = 2 2 1, (9)

Interpretation of Somers D under four simple models

where = (A B )/ is the dierence between the two means, expressed in units of the common SD. The parameter can therefore be dened as a function of Somers D by the inverse formula = 21 D(Y |X ) + 1 2 , (10)

where 1 () is the inverse standard Normal cumulative distribution function. Figure 3 gives Somers D as a function of the mean dierence, expressed in SD units. Again, we see a sigmoid curve, but this time Somers D is increasing with the alternative parameter. Note that a mean dierence of 1 SD corresponds to a Somers D just above 1/2 (approximately 52049988), corresponding to a concordance probability (or Harrells c) just above 3/4, whereas a mean dierence of -1 SDs corresponds to a Somers D of just below -1/2 (approximately -.52049988), or a Harrells c just below 1/4. For mean dierences between -1 SD and 1 SD, the corresponding Somers D can be interpolated (approximately) in a linear fashion, with Somers D approximately equal to half the mean dierence in SDs. A mean dierence of 2 SDs corresponds to a Somers D of approximately .84270079 (slightly more than 5/6), or a Harrells c of approximately .9213504 (slightly over 11/12). A mean dierence of 3 SDs corresponds to a Somers D of approximately .96610515 (slightly more than 19/20), or a Harrells c of .98305257 (over 49/50). For mean dierences over 3 SDs, the probability of discordance falls to fractions of a percent, and Somers D becomes progressively less useful, as there is very little overlap between subpopulations for Somers D to measure. The equalvariance Normal model is also often used as a toy model for the problem of dening diagnostic tests, based on a continuous marker variable Y , for membership of Subpopulation A instead of Subpopulation B . In this case, Subpopulation A might be diseased individuals, Subpopulation B might be nondiseased individuals, and Y might be a quantitative test result. If the true distribution of Y in each subpopulation is Normal, with a common subpopulation variance and dierent subpopulation means, then the log of the likelihood ratio between Subpopulations A and B is a linear function of the Y -value, given by the formula 2 2 A B log LR = B 2 A + (11) Y. 2 2 Note that the intercept term is the value of the log likelihood ratio if Y = 0, whereas the slope term is equal to / in our notation. In Bayesian inference, this log likelihood ratio is added to the log of the prior (pre test) odds of membership of Subpopulation A instead of Subpopulation B , to derive the log of the posterior (posttest) odds of membership of Subpopulation A instead of Subpopulation B . The role of Somers D in diagnostic tests is discussed in Newson (2002)[6]. Briey, if we plot true positive rate (sensitivity) against false positive rate (1-specicity), and plot points on the graph corresponding to the various possible test thresholds, and join points in ascending order of candidate threshold value, then the resulting curve is known as the sensitivityspecicity curve, or (alternatively) as the receiveroperating characteristic (ROC) curve. Harrells c is the area below the ROC curve, whereas Somers D is the dierence between the areas below and above the ROC curve. The likelihood ratio is the slope of the ROC curve (dened as the derivative of true positive rate with respect to false positive rate), and, in the equalvariance Normal model, is computed by exponentiating (11). The equalvariance Normal model, predicting the test result from the disease status, is therefore combined with Bayes theorem to imply a logistic regression model for estimating the disease status from the test result. Figure 3 therefore implies (unsurprisingly) that, other things being equal, the ROC curve becomes higher as the mean dierence (in SDs) becomes higher.

Bivariate Normal X and Y

We assume, this time, that X and Y are both continuous, with means X and Y , SDs X and Y , and a Pearson correlation coecient . As both X and Y are continuous, Somers D is equal to Kendalls a , and is given by the formula 2 D(Y |X ) = arcsin(), (12) which can be inverted to give = sin D(Y |X ) . (13) 2 This relation is known as Greiners relation. The curve of (12) is illustrated in Figure 4. Note that Pearson correlations of 1/2, -1/2, 0, 1/2, and 1/2 correspond to Kendall correlations (and therefore Somers D values) of -1/2, -1/3, 0, 1/3, and 1/2, respectively. This implies that audiences accustomed to Pearson correlations may be less impressed when presented with the same correlations on the KendallSomers scale. A possible remedy for this problem is to use the endpoint transformation (13) on condence intervals for Somers D or Kendalls a to dene outlierresistant condence intervals for the Pearson correlation.

Interpretation of Somers D under four simple models

This practice of endpoint transformation is also useful if we expect variables X and Y not to have a bivariate Normal distribution themselves, but to be transformed to a pair of bivariate Normal variables by a pair of monotonically increasing transformations gX (X ) and gY (Y ). As Somers D and Kendalls a are rank parameters, they will not be aected by substituting X for gX (X ) and/or substituting Y for gY (Y ). Therefore, the endpoint transformation method can be used to estimate the Pearson correlation between gX (X ) and gY (Y ), without even knowing the form of the functions gX () and gY (). This can save a lot of work, if the Pearson correlation between the transformed variables was what we wanted to know. Greiners relation, or something very similar, is expected to hold for a lot of other bivariate continuous distributions, apart from the bivariate Normal. Kendall (1949)[4] showed that Greiners relation is not aected by oddnumbered moments, such as skewness. Newson (1987)[7], using a much simpler line of argument, discussed the case where two variables X and Y are dened as sums or dierences of up to 3 latent variables (hidden variables) U , V and W , which were assumed to be sampled independently from an arbitrary common continuous distribution. It was shown that dierent denitions of X and Y implied the values of Kendalls a and Pearsons correlation displayed in Table 1. These pairs of values all lie along the line of Greiners relation, as displayed in Figure 4.

Table 1: Kendall and Pearson correlations for X and Y dened in terms of independent continuous latent variables U , V and W . X Y Kendalls a Pearsons U V 0 0 1 V +U W U 1 3 2 1 U V U 1 2 2 U U 1 1

Acknowledgement

I would like to thank Raymond Boston of Pennsylvania University, PA, USA for raising the issue of interpretations of Somers D, and for prompting me to summarize these multiple interpretations in a single document.

References
[1] Cox DR. Regression models and lifetables (with discussion). Journal of the Royal Statistical Society, Series B 1972; 34(2): 187220. [2] Harrell FE, Cali RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. Journal of the American Medical Association 1982; 247(18): 2543-2546. [3] Kendall MG, Gibbons JD. Rank Correlation Methods. 5th Edition. New York, NY: Oxford University Press; 1990. [4] Kendall MG. Rank and productmoment correlation. Biometrika 1949; 36(1/2): 177193. [5] Newson R. Condence intervals for rank statistics: Somers D and extensions. The Stata Journal 2006; 6(3): 309334. [6] Newson R. Parameters behind nonparametric statistics: Kendalls tau, Somers D and median dierences. The Stata Journal 2002; 2(1): 4564. [7] Newson RB. An analysis of cinematographic cell division data using U statistics [DPhil dissertation]. Brighton, UK: Sussex University; 1987. [8] Somers RH. A new asymmetric measure of association for ordinal variables. American Sociological Review 1962; 27(6): 799-811.

Interpretation of Somers D under four simple models

Figure 1: Somers D and dierence between proportions in the twosample binary model.

1 .9167 .8333 .75 .6667 .5833 .5 .4167 .3333 .25 .1667 .08333 0 .08333 .1667 .25 .3333 .4167 .5 .5833 .6667 .75 .8333 .9167 1 1 .9167 .8333 .75 .6667 .5833 .5 .4167 .3333 .25 .1667 .08333 0 .08333 .1667 .25 .3333 .4167 .5 .5833 .6667 .75 .8333 .9167 1

Somers D

Difference between proportions

Figure 2: Somers D and hazard ratio in the twosample constant hazard ratio model.

1 .9167 .8333 .75 .6667 .5833 .5 .4167 .3333 .25 .1667 .08333 0 .08333 .1667 .25 .3333 .4167 .5 .5833 .6667 .75 .8333 .9167 1 2048 1024 512 256 128 64 32 16 8 4 2 1 .5 .25 .125 .0625 .03125 .01563 .007813 .003906 .001953 .000977 .000488

Somers D

Hazard ratio

Interpretation of Somers D under four simple models

Figure 3: Somers D and mean dierence in SDs in the twosample equalvariance Normal model.

1 .9167 .8333 .75 .6667 .5833 .5 .4167 .3333 .25 .1667 .08333 0 .08333 .1667 .25 .3333 .4167 .5 .5833 .6667 .75 .8333 .9167 1 5 4.5 4 3.5 3 2.5 2 1.5 1 .5 0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5

Somers D

Mean difference (SDs)

Figure 4: Somers D and Pearson correlation in the bivariate Normal model.

1 .9167 .8333 .75 .6667 .5833 .5 .4167 .3333 .25 .1667 .08333 0 .08333 .1667 .25 .3333 .4167 .5 .5833 .6667 .75 .8333 .9167 1 1 .9167 .8333 .75 .6667 .5833 .5 .4167 .3333 .25 .1667 .08333 0 .08333 .1667 .25 .3333 .4167 .5 .5833 .6667 .75 .8333 .9167 1

Somers D

Pearson correlation coefficient

Das könnte Ihnen auch gefallen