Beruflich Dokumente
Kultur Dokumente
REVIEW ARTICLE
SUMMARY
Background: In intervention trials, only randomization guarantees equal
T here is consensus in medical research that the
primary method for evaluating treatments is the
randomized controlled trial. Randomization is the only
distributions of all known and unknown patient characteristics between an method that guarantees equal distributions of known
intervention group and a control group and enables causal statements on
and unknown patient characteristics between an inter-
treatment effects. However, randomized controlled trials have been criticized
vention group and a control group and enables causal
for insufficient external validity; non-randomized trials are an alternative here,
statements on treatment effects. However, randomized
but come with the danger of intervention and control groups differing with
controlled trials are in some cases “unnecessary, in-
respect to known and/or unknown patient characteristics. Non-randomized
appropriate, impossible, or inadequate” (1) and also
trials are generally analyzed with multiple regression models, but the so-called
continue to be criticized for a lack of external validity:
propensity score method is now being increasingly used.
patients in randomized controlled trials are usually
Methods: The authors present, explain, and illustrate the propensity score younger and healthier than the average patient (2, 3).
method, using a study on coronary artery bypass surgery as an illustrative Non-randomized studies can be an alternative for
example. This article is based on publications retrieved by a selective literature evaluating treatments. However, they suffer from a lack
search and on the authors’ scientific experience. of internal validity: treatment allocation is not ran-
Results: The propensity score (PS) is defined as the probability that a patient domized and the intervention and control groups may
will receive the treatment under investigation. In a first step, the PS is esti- be systematically different in terms of known and (even
mated from the available data, e.g. in a logistic regression model. In a second worse) unknown patient characteristics. Any differ-
step, the actual treatment effect is estimated with the aid of the PS. Four ences between groups that arise during a study are
methods are available for this task: PS matching, inverse probability of treatment therefore not necessarily due to differences in treat-
weighting (IPTW), stratification by PS, and regression adjustment for the PS. ment: they may have been caused by the systematic
Conclusion: The propensity score method is a good alternative method for the differences between the groups.
analysis of non-randomized intervention trials, with epistemological advan- A range of statistical procedures have been devel-
tages over conventional regression modelling. Nonetheless, the propensity oped to take account of these differences during analy-
score method can only adjust for known confounding factors that have actually sis. The standard procedures for this are multiple re-
been measured. Equal distributions of unknown confounding factors can be gression models. However, propensity scores are also
achieved only in randomized controlled trials. being used more and more frequently (4). This article
introduces propensity scores and describes and ex-
►Cite this as: plains them in detail, first in general terms and then
Kuss O, Blettner M, Börgermann J: Propensity score: an alternative method
using an example from coronary bypass surgery. Next,
of analyzing treatment effects—part 23 of a series on evaluation of scientific
the differences between propensity scores and conven-
publications. Dtsch Arztebl Int 2016; 113: 597–603.
tional regression models are stated. The article con-
DOI: 10.3238/arztebl.2016.0597
cludes with a number of essential observations on
obtaining knowledge in medical research.
Propensity score
The propensity score (PS) is the probability of a patient
receiving the treatment being tested. In a 1:1 ran-
domized trial, this is exactly 0.5. In a non-randomized
study, this probability for each individual patient is
German Diabetes Center, Institute for Biometrics and Epidemiology and Centre for Health and Society unknown and depends on patient characteristics. The
(chs), Heinrich-Heine-Universität Düsseldorf: Prof. Dr. sc. hum. Kuss PS must therefore first be estimated from the available
Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center Mainz: data. A logistic regression model in which the allocated
Prof. Dr. rer. nat. Blettner treatment is the dependent variable and the patient
Department of Cardiothoracic Surgery, Heart and Diabetes Center North Rhine–Westphalia, characteristics before treatment are used as the in-
Ruhr-University Bochum, Bad Oeynhausen: PD Dr. med. Börgermann dependent variable can be used for this. Using the
Deutsches Ärzteblatt International | Dtsch Arztebl Int 2016; 113: 597–603 597
MEDICINE
TABLE 1
Properties of the four different propensity score (PS) methods and of conventional regression analysis in evaluating non-randomized treatment
effects
Method
PS method Conventional
regression analysis
PS matching IPTW estimation Stratification Regression
adjustment for the PS
Allows for easy assessment of comparability
of treated and untreated patients + (+) (+) − −
Allows assessment of balance of
characteristics in the data + + (+) − −
Uses complete dataset (smaller variance of
the treatment effect, greater danger of bias) − + + + +
Similar to an RCT (generates comparable
groups, ignores outcomes) + (+) (+) − −
Robust against outliers (patients with
extreme propensity scores) + − + + +
Fewer statistical assumptions in the model + + (+) − −
RCT, randomized controlled trial; IPTW, inverse probability of treatment weighting; PS, propensity score
“+” stands for “yes”; “-” stands for “no”; “(+)” stands for “partially given”
estimated parameters of this PS model, the propensity patient in terms of his/her characteristics (expressed as
score can then be calculated for each individual patient. his/her low PS), so a valid comparison can be made
When selecting independent variables for the PS between the two. For the evaluation of the treatment
model, care must be taken to use characteristics that effect, patients enter the statistical analysis according to
predict subsequent treatment success (rather than treat- their weight.
ment allocation), as these limit the variance of the treat- Stratification: PS stratification is a coarsened form
ment effect without giving rise to any additional bias of PS matching. Here, the total dataset is divided into
(5). Naturally, the PS model cannot take into account several subsets of equal size (e.g. quintiles) on the basis
factors that are unknown or were not measured. of the estimated PS. In each subset, treatment effect is
The second step is to use the propensity score to esti- estimated using conventional methods, and the treat-
mate the treatment effect of interest. There are four ment effects obtained in this way are then summarized
methods for using the propensity score (6): by meta-analytic methods.
● PS matching Regression adjustment for the PS: In regression
● Inverse probability of treatment weighting adjustment for the PS, a conventional regression model
(IPTW) estimation is estimated using the outcome of interest as the de-
● Stratification pendent variable and treatment effect and PS as inde-
● Regression adjustment for the PS. pendent variables. The effect of the treatment on the
PS matching: In PS matching, each treated patient outcome is thus adjusted for the PS, and thereby for all
is allocated one untreated patient (in 1:1 matching), or patient characteristics included in the PS.
more than one untreated patient (in 1:n matching), with Each of these methods has specific strengths and
the same PS or with a PS that differs only slightly, with- weaknesses, but PS matching is generally described as
in previously defined limits. The treatment effect is the preferred procedure (9, 10). The main advantage of
then estimated in the matched population, while ac- PS matching is the ability to display the recorded char-
counting for the matching process in the statistical acteristics of treated and untreated patients explicitly,
analysis (7). similarly to a Table 1 in a randomized controlled trial.
IPTW estimation: In IPTW estimation, each patient This enables assessment of whether the distribution of
is allocated the reciprocal of the treatment probability these characteristics is similar in treated and untreated
associated with his/her actual treatment as a statistical patients. In addition, the distribution of patient charac-
weight: a treated patient receives the weight 1/PS, and teristics before PS matching should be shown, in order
an untreated patient the weight 1/(1-PS). There are to make clear the extent to which PS matching has
mathematical reasons for this definition of weights, but compensated for differences that were originally pres-
it can also be interpreted intuitively (8): A treated ent.
patient with a low PS (for the treatment) receives a high Inevitably, PS matching excludes patients for whom
weighting because he/she is similar to an untreated no matching partner can be found, while all other PS
598 Deutsches Ärzteblatt International | Dtsch Arztebl Int 2016; 113: 597–603
MEDICINE
x
The abbreviations “PS 1” to “PS 4” stand for
0.610.95 0.61 0.49 0.49 the four methods of propensity score (PS)
PS 0.89 PS analysis:
model matching Analysis PS 1 = PS matching
0.49 0.89 0.61 0.61 PS 2 = Inverse probability of treatment
0.21
x
weighting (IPTW) estimation
0.49 0.61 0.89 0.89 0.95 PS 3 = Stratification
PS 4 = Regression adjustment for the PS
At the beginning of every PS analysis there
is a group of patients who either have been
PS 2
treated with the intervention of interest (red)
0.61 0.61 or with a control intervention (blue). The
0.610.95
0.89 available patient characteristics are used to
PS IPTW Weighted
model weighting analysis estimate a PS model, and each patient’s
0.49 0.89 0.49 0.210.49 0.89 propensity score is calculated (shown as
0.21 numerical values on the pictograms in the
0.49 0.61 Figure). Depending on the PS method used,
0.89 0.95 patients are then either matched (PS 1:
patients for whom no matching partner has
been found are usually excluded; these are
labeled with an X), weighted according to
PS 3
their PS (PS 2: patients with a higher IPTW
0.21 0.49 0.49 are larger in the Figure), stratified (PS 3:
0.610.95
here in tertiles), or included in a regression
PS 0.89 Stratified
model Stratification analysis model with the PS as an independent vari-
0.49 0.89 0.61 0.61 able (PS 4). Clinical outcomes are analyzed
0.21 with respect to the chosen PS method. (For
simplicity, the Figure shows cured patients
0.49 0.61 0.89 0.89 0.95 in a cheering pose.)
In contrast, in a conventional regression
model a single statistical model is calculat-
PS 4 ed. The clinical outcome is the dependent
variable of the model, while treatment and
0.61 0.95 other patient characteristics are indepen-
PS 0.89 dent variables.
Analysis of the out-
model come in a regression
0.49 0.89 The bottom section of the Figure illustrates
model with treatment the similarity between a randomized con-
0.21
and PS as indepen- trolled trial (RCT) and a PS analysis: initially,
0.61 dent variables
0.49 patients in an RCT have not yet been treated
(gray). Their PS (i.e. the probability of under-
going the intervention) is known: it is 0.5.
On randomization, each patient is allocated
Regression model to receive a treatment, so, as with PS, one
group of treated patients and one group of
control patients is formed. Finally, clinical
outcomes are analyzed
Direct analysis of the outcome in
the population of treated and
untreated patients
RCT
0.50 0.50
0.50 0.50
0.50 0.50 0.50
“True” PS Randomization Analysis
0.50 0.50
0.50 0.50 0.50
0.50 0.50 Untreated
0.50 0.50
Intervention
Control
Deutsches Ärzteblatt International | Dtsch Arztebl Int 2016; 113: 597–603 599
MEDICINE
TABLE 2
Preoperative patient characteristics before and after PS matching* (modified according to [16])
*Mean ± standard deviation is given for continuous patient characteristics. Relative frequency as a percentage is given for categorical patient characteristics.
BMI, body mass index; cCABG, conventional CABG; CABG, coronary artery bypass grafting; COPD, chronic obstructive pulmonary disease; IABP, intra-aortic balloon pump; clampless OPCAB,
clampless off-pump coronary artery bypass grafting; LVEF, left-ventricular ejection fraction; PAOD, peripheral arterial occlusive disease; PS, propensity score
methods use the entire dataset for analysis. This can re- are inappropriate for revealing unknown confounding
sult in lower case numbers and so less statistical power factors (13). Worse still, a high c-statistic is neither
for PS matching, but it does have the advantage that necessary nor sufficient for good adjustment for con-
looking at excluded patients makes clear which patients founding factors. This can be illustrated by the example
were overrepresented or underrepresented in the treat- of a randomized controlled trial, the design of which by
ment group. As a result, no statements can subsequently definition achieves a very good balance between con-
be made on these subgroups either. founding factors, but which will have a very small
Finally, the question when considering PS matching c-statistic (approx. 0.5) (14). Many measures have been
versus other PS methods always involves a trade-off proposed specifically to measure balance between
between a biased or imprecise estimate of treatment ef- patient characteristics (6, 15).
fect (8). PS matching should be used when the groups Further methodological development of the pro-
need to be as similar as possible (thus minimizing bias). pensity score method continues. Unfortunately, it is
However, because case numbers will then be smaller, a impossible to examine other important aspects (e.g.
greater variance of the estimated treatment effect must dealing with missing values, minimum requirements
be accepted. Table 1 gives an overview of the strengths for sample sizes, software, influence of various match-
and weaknesses of the various methods. The Figure ing algorithms) in more detail here.
provides a schematic representation of the four PS
methods versus those for a randomized controlled trial An example
and a conventional regression analysis. In the following we report on a published PS analysis
The quality of a PS model should only be judged on on coronary bypass surgery (16) which was performed
the basis of how well patient characteristics are by the first and last author of this article together. It was
balanced between the two treatment groups. Neither based on a dataset from a total of 1282 patients who
goodness-of-fit tests such as the Hosmer–Lemeshow underwent isolated heart surgery at the Herz- und
test (11) nor discrimination measures such as the c-sta- Diabeteszentrum NRW, Bad Oeynhausen between July
tistic (12) are suitable for this. Both these procedures 2009 and November 2010. Of these patients, 69.2%
600 Deutsches Ärzteblatt International | Dtsch Arztebl Int 2016; 113: 597–603
MEDICINE
Deutsches Ärzteblatt International | Dtsch Arztebl Int 2016; 113: 597–603 601
MEDICINE
TABLE 3
Results for the three clinical outcomes in the PS-matched patient group (n = 788) (modified according to [16])
Binary outcome
Clampless OPCAB (n = 394) cCABG (n = 394) Odds ratio [95% CI] p-value
Postoperative death or stroke [n (%)] 6 (1.5) 22 (5.6) 0.27 [0.11; 0.67] 0.005
Continuous outcome
Clampless OPCAB (n = 394) cCABG (n = 394) MD [95% CI] p-value
Operative time in minutes [mean (SD)] 175 (38) 180 (47) 5 [−1; 11] 0.12
Time-to-event outcome
Clampless OPCAB (n = 394) cCABG (n = 394) Hazard ratio [95% CI] p-value
Time to death or stroke in follow-up (probability of 94.7 89.8 0.60 [0.35; 1.03] 0.06
neither event at one year in %)
cCABG, conventional coronary artery bypass grafting; CI, confidence interval; clampless OPCAB, clampless off-pump coronary artery bypass grafting; MD, mean difference;
PS, propensity score; SD, standard deviation
602 Deutsches Ärzteblatt International | Dtsch Arztebl Int 2016; 113: 597–603
MEDICINE
19. Rubin DB, Thomas N: Matching using estimated propensity scores: relating theory
KEY MESSAGES
to practice. Biometrics 1996; 52: 249–64.
● Propensity scores are increasingly being used to analyze non- 20. Hedderich J, Sachs L: Angewandte Statistik. Berlin, Heidelberg, New York:
Springer 2016; 264.
randomized studies.
21. Rubin DB: The design versus the analysis of observational studies for causal
● The randomized controlled trial remains the study design of choice effects: parallels with the design of randomized trials. Stat Med 2007; 26: 20–36.
for testing treatment efficacy. However, it is important to ensure that 22. Martens EP, de Boer A, Pestman WR, Belitser SV, Stricker BH, Klungel OH:
Comparing treatment effects after adjustment with multivariable cox proportional
this knowledge does not ossify into dogma in clinical research. hazards regression and propensity score methods. Pharmacoepidemiol Drug Saf
● Propensity scores cannot replace randomization but are a good 2008; 17: 1–8.
alternative for analyzing non-randomized trials. 23. Pattanayak CW, Rubin DB, Zell ER: [Propensity score methods for creating
covariate balance in observational studies]. Rev Esp Cardiol 2011; 64: 897–903.
● Like conventional regression models, propensity scores can only 24. Cepeda MS, Boston R, Farrar JT, Strom BL: Comparison of logistic regression
adjust for patient characteristics that are known and have actually versus propensity score when the number of events is low and there are multiple
been measured. Only randomized controlled trials can achieve confounders. Am J Epidemiol 2003; 158: 280–7.
equal distribution of unknown confounding variables too. 25. Braitman LE, Rosenbaum PR: Rare outcomes, common treatments: analytic
strategies using propensity scores. Ann Intern Med 2002; 137: 693–5.
● Demand from patients, clinicians, and the health-care system for 26. Anglemyer A, Horvath HT, Bero L: Healthcare outcomes assessed with
evidence from non-randomized studies will continue to increase in observational study designs compared with those assessed in randomized trials.
the next few years. Cochrane Database Syst Rev 2014; 4: MR000034.
27. Eichler M, Pokora R, Schwentner L, Blettner M: Evidenzbasierte Medizin –
Möglichkeiten und Grenzen. Dtsch Arztebl 2015; 112: A 2190–2.
28. Manson JE, Hsia J, Johnson KC, et al., Women’s Health Initiative Investigators:
Estrogen plus progestin and the risk of coronary heart disease. N Engl J Med
2003; 349: 523–34.
12. Harrell FE: Regression modeling strategies. New York: Springer 2001; 257. 29. Abel U, Koch A: The role of randomization in clinical studies: myths and beliefs.
J Clin Epidemiol 1999; 52:487–97.
13. Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor VM: Weaknesses of good-
ness-of-fit tests for evaluating propensity score models: the case of the omitted 30. Hernán MA, Alonso A, Logan R: Observational studies analyzed like randomized
confounder. Pharmacoepidemiol Drug Saf 2005; 14: 227–38. experiments: an application to postmenopausal hormone therapy and coronary
heart disease. Epidemiology 2008; 19: 766–79.
14. Westreich D, Cole SR, Funk MJ, Brookhart MA, Stürmer T: The role of the c-sta-
31. Stuart EA: Matching methods for causal inference: a review and a look forward.
tistic in variable selection for propensity score models. Pharmacoepidemiol Drug
Statistical Science 2010; 25: 1–21.
Saf 2011; 20: 317–20.
32. Borah BJ, Moriarty JP, Crown WH, Doshi JA: Applications of propensity score
15. Belitser SV, Martens EP, Pestman WR, Groenwold RH, de Boer A, Klungel OH: methods in observational comparative effectiveness and safety research: where
Measuring balance and model selection in propensity score methods. have we come and where should we go? J Comp Eff Res 2014; 3: 63–78.
Pharmacoepidemiol Drug Saf 2011; 20: 1115–29.
16. Börgermann J, Hakim K, Renner A, et al.: Clampless off-pump versus conventional
coronary artery revascularization: a propensity score analysis of 788 patients. Corresponding author:
Circulation 2012; 126 (11 Suppl 1): S176–82. Prof. Dr. sc. hum Oliver Kuß
17. Austin PC: Optimal caliper widths for propensity-score matching when estimating German Diabetes Center (DDZ)
differences in means and differences in proportions in observational studies. Leibniz Center for Diabetes Research, Heinrich Heine University
Pharm Stat 2011; 10: 150–61. Institute for Biometrics and Epidemiology
Auf’m Hennekamp 65
18. Kuss O: The z-difference can be used to measure covariate balance in matched 40225 Düsseldorf, Germany
propensity score analyses. J Clin Epidemiol 2013; 66: 1302–7. oliver.kuss@ddz.uni-duesseldorf.de
Deutsches Ärzteblatt International | Dtsch Arztebl Int 2016; 113: 597–603 603