Key Statistical and Analytical Issues For Evaluating Treatment Effects in Periodontal Research

Periodontology 2000, Vol. 59, 2012, 7588 Printed in Singapore.
All rights reserved
2012 John Wiley & Sons A/S
PERIODONTOLOGY 2000
Key statistical and analytical issues for evaluating treatment effects in periodontal research
Y U -K A N G T U & M A R K S. G I L T H O R P E
It is difcult to nd any publication in mainstream periodontal journals that has not used some type of statistical analysis, ranging from simple descriptive statistics, such as mean and standard deviation, to more sophisticated statistical modeling, such as generalized linear models. Nevertheless, many periodontal researchers have mixed feelings towards statistics. They recognize that statistics or statistical methods are very important and useful tools for quantitative clinical research, yet they also nd statistics to be a rather difcult subject to comprehend. Whilst they are always instructed to seek advice from professional statisticians, it is not easy to nd statisticians who are both available to collaborate and can also communicate with clinicians using an understandable language. Whilst most researchers have attended some courses in basic statistics during their early education or research training, these courses are not designed to provide sufcient knowledge and skills for undertaking complex statistical analysis. Furthermore, there is always a knowledge gap between simple textbook examples and real research practices. In our experience, most researchers not only nd the mathematical equations of statistical tests hard to digest, they also feel uncertain of how to choose the most appropriate statistical test to analyze their data and how to interpret the results appropriately. With the advance of personal computers and userfriendly statistical software packages, there is an increasing temptation for researchers to analyze their data using sophisticated statistical methods without consultation with experienced biostatisticians. In the last few decades, many medical statisticians have warned that the standard of statistical analyses in clinical research needs to be improved and have urged for greater collaboration between clinicians and statisticians (25, 33). Whilst we totally agree with these observations and the authors suggestions, we also recognize that for many dental and periodontal researchers, statisticians are not always around to help, and with proper training, nonstatisticians are able to undertake some simple analyses. The aim of this article was to provide an overview of several important basic statistical issues related to the evaluation of treatment effects and to clarify some statistical misconceptions for periodontal researchers. Some of these issues are general, concerning many disciplines, and some are unique to periodontal research. In the following sections, we rst discuss several statistical concepts that have sometimes been overlooked or misunderstood by periodontal researchers. Then we go on to explain a few more advanced methodological issues that perplex many researchers. In the nal section, we provide some personal reections on current statistical practice in periodontal research and make a few recommendations for improvement. As the aim of this article is to help readers grasp statistical concepts, we have tried to minimize the use of mathematical equations. It is not feasible, and is also unnecessary, to give mathematical details in this article because those details can be readily found in good introductory textbooks. Our intention is to encourage the periodontal research community to take greater care and consideration in the statistical analyses they might use in the future.
Statistical issues
Mean vs. median
Almost every introductory textbook of medical statistics starts with an introduction of what the mean is and what the median is (11, 22, 44). The mean is the sum of
75
Tu & Gilthorpe
values of n observations divided by n, and the median is the midway value. Half of the observations are above the median and half are below it; this is not always true for the mean, however, unless the distribution of values is symmetrical (for example, in the normal distribution, which has a shape that resembles a bell). If n is an even number, such as 10, the median is the average of the two values in the middle (such as the fth and sixth values when n = 10). Most periodontal researchers know these denitions, but many are confused regarding whether to use the mean or the median to summarize a variable. It is usually recommended that when a variable follows a normal distribution, the mean and standard deviation should be reported. However, when a variable does not follow a normal distribution, the median and (interquartile) range should be reported instead (the interquartile range is given by those values that lie one quarter and three quarters along the list of values when ranked in order; again, averages of neighboring values may be needed depending upon the size of n). When a variable follows a normal distribution, the mean should be close to the median. When the distribution is skewed (i.e. not symmetric), the median can be much smaller or larger than the mean. Many variables collected in periodontal research do not follow a normal distribution; for example, probing pocket depth in patients tends to be positively skewed with a long tail to the right because pocket depth can be up to 9 or 10 mm and it does not have negative values. Does this mean that it is wrong to report the mean for a summary of pocket depth and that the median should be reported instead? From a practical point of view, it is not necessarily wrong to use the mean to summarize a skewed variable, as the mean is just a mathematical expression. Mean and median contain different information, and whether to report the mean or the median depends on which information is more important in the research context. For example, suppose a patient has 20 teeth in their mouth, and pocket depth is measured in six sites around each tooth, totaling 120 sites. If, say, 100 sites have a pocket depth of just 1 mm and the other 20 sites have a pocket depth of 8 mm, the median is 1 and the interquartile range is 1 to 1, suggesting extremely good periodontal health, although the range of 18 mm may indicate some isolated periodontal problems. In contrast, the mean is 2.2 mm with a standard deviation of 2.6 mm, indicating a greater variation in periodontal health. This hypothetical example shows that neither the mean nor the median is a perfect summary for the information contained in the data, and researchers
should use their clinical knowledge to decide which provides more important information.
Standard deviation vs. standard error of the mean

One common practice in periodontal literature is to report the mean and the standard error of the mean (SEM) to summarize the location and spread of a variable, whilst a better practice is to report the mean and the standard deviation (SD) (6). This error probably arises because the SEM and the SD are mathematically related: p SEM SD= n; hence, many researchers may think that they are exchangeable. One caveat of using the SEM for summarizing variables is that its values are much smaller than those of the SD when the sample size is large, giving a misleading impression that the spread of observation is small. However, their interpretations are very different: the SD is a simple description for the spread of the observations, whilst the SEM is a statistical inference about the location of the mean. The SD provides information about the distribution of the population of data observed (i.e. the sample), whilst the SEM is an inference about the possible range of the mean of the population from which our sample is taken. When a larger sample is taken from the same population, its SD remains similar to that of the smaller sample; however, its SEM becomes smaller because a larger sample contains more reliable information about where the population mean is likely to be.
Condence intervals vs. P-values

Most statisticians prefer reporting condence intervals to P-values, and this is not without good reason. Many researchers crave for small P-values (e.g. P < 0.05 and P < 0.001) in the output generated by statistical software and often ignore all other information. For many researchers, the only reason for conducting a statistical analysis is to nd the P-value, and consider that only those results with small P-values are worthy of writing-up and publishing. Whilst P-values are still rampant in many papers in medical and dental journals, their use has been heavily criticized in psychological and epidemiological journals (36, 41, 43, 54, 63, 64, 68). As explained later in this article, P-values are actually related to both effect size and sample size, and the interpretation of effect size is more important than P-values. P-values are the probability of how
76
Statistical and analytical issues in periodontal research
likely the observed effect size is, given that the null hypothesis is true (this usually means that the effect size is null). If this probability is small, for example <0.05, it is then concluded that the null hypothesis is false and the effect size is more likely to be genuine. (Note, however, if the null hypothesis is false, the interpretation of the P-value conditional upon the null hypothesis being true, is erroneous!) As it is more important to know how much better one treatment works than the other, or how much difference there is between two groups in their mean outcomes, the focus of interpretation should be the effect size and its possible range, in other words the condence interval. Furthermore, the usual cut-off probability value, of 0.05, for rejecting the null hypothesis is arbitrary, and it has unfortunately caused the division of results into nonsignicance and signicance. A common misconception is that if the P-value is very small, the treatment effect is true; if the P-value is large, there is no treatment effect. For example, suppose one study nds that the difference in the reduction of plaque scores between an electrical toothbrush A and a manual toothbrush B is 10% with a P-value of 0.01, and therefore A is better than B because the P-value is <0.05. Another study nds that the difference in the reduction of plaque scores between an electrical toothbrush C and the manual toothbrush B is also 10% with a P-value of 0.1, and therefore there is no difference between C and B. The differences in effect size (i.e. the reduction in plaque scores) between the two treatments in these two studies are the same; however, one has a larger P-value because it has greater condence intervals (in other words, there is a larger uncertainty in the estimation of the effect size). This may be a result of smaller sample size, more heterogeneous responses to the treatments, or both. Similarly, a small effect size without practical signicance can be statistically signicant, whilst a large effect size can be statistically nonsignicant. The P-value does not provide the same quality of information as a condence interval and should only be used where a condence interval is not feasible, as when conducting a chi-square test, for instance. Even so, one has to be sure that the test is pertinent (i.e. there is an a priori research question to be addressed), as indiscriminate use of statistical testing, hence the use of P-values, is poor practice (36, 41, 43, 54, 63, 64, 68).
Statistical power sample size calculation

Prior statistical power and sample size calculation has become the norm for clinical trial design, and how to
conduct sample size calculations for simple studies can be found in many textbooks (11, 22, 44). There are two main reasons for this requirement: rst, it helps researchers to estimate the cost and time required for the study; and, second, the interpretation of the nal results is more likely to be clinically relevant and informative. A study without a sufcient number of patients may give rise to inconclusive results, such as those discussed in the previous section on the P-values, where there is a difference in the treatment effects but the condence intervals are large. In contrast, a study with too many patients wastes precious resources, and a very small, potentially clinically insignicant, difference in treatment effects can be statistically signicant. Sample size calculation requires prior knowledge of the potential effect size, the variation of effect size, and the required statistical power. For example, a clinical trial is designed to test whether the use of enamel matrix derivatives in conjunction with bone grafts is better than enamel matrix derivatives alone in the treatment of infrabony lesions. The primary outcome is changes in clinical attachment level. According to a previous study, the estimated difference in clinical attachment level gain is 0.5 mm and the standard deviation is 1.8 mm. Two issues need to be considered here: one is the type-I error (false-positive) rate (i.e. how likely the trial shows a 0.5-mm difference in clinical attachment level gain between the two treatments when in fact their treatment effects are the same). Usually, this is set as 0.05 (i.e. the threshold of the P-value for statistical signicance). The second is the type-II error (false-negative) rate (i.e. the probability that the trial fails to demonstrate the difference between the two treatments when, in fact, one is better than the other). The type-II error rate is usually set as 0.2, and 1 minus the type-II error is known as the statistical power, often expressed as a percentage (i.e. 80% power in this instance). Apparently, an ideal clinical trial should have a large statistical power (>80%) and a small type-I error (<0.05). Results given by the statistical software package STATA (version 11.1; Statacorp., College Station, TX, USA) show that the required sample size for the hypothetical trial is 204 patients in each treatment group, in other words 408 patients in total, with statistical power equal to 80%. If we want to have greater statistical power, such as 90%, each treatment group needs to have 273 patients. As some patients may withdraw from the trial, usually an extra 10% or so of patients will be recruited to account for this. Nevertheless, it should be noted that even though sufcient numbers of patients are recruited according to the prior sample-size calculation, this does not
77
Tu & Gilthorpe
guarantee that a difference in treatment effects will be found or that the results will be statistically signicant, because the difference in treatment effects may be smaller than expected or the variation in treatment effects may be greater than expected, or both. For example, the results of periodontal treatments undertaken by inexperienced clinicians may show greater variation than those carried out by an expert. The success of a clinical study depends upon many factors, only one of which is sample size calculation.
Clinical vs. statistical signicance

In the previous example given for sample size calculation, the difference in treatment effects is set as 0.5 mm between the use of enamel matrix derivatives in conjunction with bone grafts and the use of enamel matrix derivatives alone. However, even if the trial eventually shows a statistically signicant 0.5 mm difference in clinical attachment level gain, it is debatable whether such a small difference is clinically relevant (41, 43, 63, 64, 68). First, clinical attachment level gain is a surrogate end point, which may not be related to true end points, such as long-term tooth survival. Second, the additional cost and time required to achieve an average 0.5 mm clinical attachment level gain should also be considered. Third, any additional risks and complications associated with the treatment also need to be taken into account when making clinical decisions. When the sample size is sufciently large, even a tiny difference can be shown to be statistically signicant.
t-test vs. MannWhitney U-test (a parametric test vs. a nonparametric test)

Both the t-test and the MannWhitney U-test are carried out to compare the difference between two independent groups. The t-test is a parametric test, as it assumes that samples are taken independently from a normal distribution, and the difference between the two groups follows the t-distribution, which is dened mathematically with just one parameter (the degree of freedom) (58). In contrast, the MannWhitney U-test does not have any specic distributional assumptions for the sample, so it is known as a nonparametric test. Many periodontal studies use the MannWhitney U-test because their outcome variables do not follow a normal distribution. This argument requires some considerations. First, although the sample distribution does not look
like a normal distribution, this does not always mean that the population from which the sample is taken does not follow a normal distribution. Second, variables such as clinical attachment level gain that are not normally distributed (as explained previously, periodontal outcomes tend to be positively skewed) do not necessarily preclude the use of parametric tests (16, 20, 53). The t-test is used to test the difference in the distribution of the means of the two treatment groups, and the mean of a variable with a skewed distribution may follow the normal distribution. Suppose we take 100 measurements of clinical attachment level from each of 100 patients. Although each sample of 100 clinical attachment level measurements does not follow the normal distribution, the 100 mean clinical attachment level values from the sample are likely to be normally distributed. This is known as the central limit theorem (20). Furthermore, whilst the MannWhitney U-test does not assume any specic underlying distribution, it does assume that the two groups follow a similar distribution, where the aim is to test the difference of the central locations of the distributions (e.g. the means or the medians) (20, 38). In fact, when the two distributions have the same shape, if their medians are different, their means will also be different. However, if the two distributions do not have the same shape, the MannWhitney U-test may detect the difference in shape rather than the difference in central locations (18, 38). Figure 1 shows two variables A and B with 101 observations each. Although the medians of both A and B are 1, the MannWhitney U-test is highly signicant (P < 0.001). In this example, the MannWhitney U-test detects the difference in the shapes (i.e. the distribution) of the two samples rather than their central locations (i.e. medians) (20). This shows that only where there is an additional assumption that the two samples have a similar shape can we say that the test is a test of the difference in medians. This has been frequently overlooked in periodontal research (20, 38). Whilst there are no distributional assumptions behind nonparametric tests, it has been argued that routinely applying nonparametric tests is not justied because: (i) the t-test is quite robust to the violation of its normal distribution assumption, (ii) a nonnormal distribution can be transformed to be approximately normal, and (iii) the results from parametric tests are more clinically meaningful, as it is difcult to calculate condence intervals for nonparametric tests, and when condence intervals are available for these types of test, their clinical interpretation is not straightforward (16, 20).
78
Fig. 1. Distributions of observed values of two samples that have the same median but different shapes (sample size = 101).
t-test vs. analysis of covariance

Both the t-test and analysis of covariance (ANCOVA) can be used to analyze data from clinical trials with the pretest post-test design (i.e. outcomes are measured at baseline and again after the intervention). Both tests aim to detect the difference in changes from baseline between two groups. Some statisticians strongly recommend the use of ANCOVA because, in general, ANCOVA yields greater statistical power and minimizes bias (62, 82). This is because the adjustment made by baseline values in ANCOVA removes some background noise in the data and therefore reduces the variations in treatment effects. However, this argument is only valid for studies where patients are randomly assigned to the two groups (10, 70). When patients are from naturally occurring groups (e.g. men vs. women or smokers vs. nonsmokers, or are patients receiving different treatments based on clinicians personal judgments), it may not be appropriate to use ANCOVA to analyze observational data because of the risk of potential over adjustment (47, 51, 83). This is because the adjustment of baseline variables may not be appropriate where the difference in baseline values between treatment groups cannot be explained by chance alone. For example, suppose a study wants to look at whether deep infrabony defects respond to periodontal regenerative therapy better than shallow defects in terms of radiographic defect ll. As defect depth is a natural characteristic of infrabony lesions which cannot be assigned randomly, the adjustment of baseline defect depth would be questionable.
Adjustment for multiple testing

Whether or not to adjust for multiple testing, and how, are controversial issues that have been debated
in the clinical and statistical literature for many years (34, 57, 5961, 67). The need to adjust for multiple testing arises when researchers try to compare the differences among multiple groups or to compare many differences within one group. Suppose a type-I error rate of 0.05 for a study to compare two treatments is used for statistical signicance. If this study is repeated 100 times by randomly selecting patients from the population each time, it is expected that we may nd a statistically signicant difference ve out of 100 times, even though the two treatments have the same effect. Of course, most studies are carried out only once, and this is why when the P-value is small, we tend to believe that the difference is genuine. However, if multiple testing procedures are undertaken, either because there are more than two groups or more than one outcome is tested in the study, the P-value can no longer be trusted and interpreted as it stands. Multiple comparison procedures aim to tackle this problem by constructing wider condence intervals (i.e. using thresholds much smaller than 0.05 for the type-I error). Most statistical software packages provide several different multiple comparison procedures, and many clinicians (and probably statisticians) are unsure which is most appropriate. The Bonferroni correction is probably the most popular in clinical research, but also the most heavily criticized (57, 59, 60). There is trade-off in the use of a procedure, such as the Bonferroni t-test, to adjust for multiple testing. Suppose that there are ve outcome variables to be tested in a study investigating differences in the treatment effects between two periodontal-regenerative techniques. The Bonferroni t-test uses P = 0.05 5 = 0.01 as the signicance level instead of the usual level of 0.05. Whilst setting a smaller signicance level reduces the risk of false-
79
Tu & Gilthorpe
positive errors (i.e. one or more of the ve comparisons appears to be signicant, whilst there is in fact no difference in the treatment effects between the two techniques), this also increases the risk of falsenegative errors. Each procedure has its proponents and opponents, and more complex adjustment procedures have been developed to improve existing ones (60, 67). As explained in the previous sections, the P-values should not be overtly relied upon for the interpretation of research ndings. There are many factors to be considered, of which the adjustment of multiple testing is just one.
Analytical issues
In the following sections, we try to explain several important concepts that are not specic to a particular test. They are concerned more with statistical thinking than with statistical testing.
Regression to the mean

Regression to the mean is a very common statistical phenomenon, which is perhaps widely known but less well understood (12, 27, 28, 65, 74, 78). It was rst discovered by Francis Galton more than a century ago when he was studying the heredity of human traits (28). He found that tall parents tended to have tall children, but on average these children were shorter than their parents. On the other hand, short parents tended to have short children, but on average these children were taller than their parents. This does not mean that children of tall parents are always shorter than their parents or that children of short parents are always taller than their parents. Regression to the mean is about the average trend and inherent variation. When we randomly take a sample from a population, those with extremely high or low values tend to become lower or higher on subsequent occasions. For example, if we measure full-mouth pocket depth in a periodontal patient, twice, half an hour apart, the sites with the deepest pocket depth tend to become shallower and those with shallow pockets become deeper. The positioning, angle and force are unlikely to be consistent in both occasions, and rounding to the nearest millimeter may also contribute to the differences in the two readings of probing depths for the same site. Observed probing depth is a combination of true (unobserved) pocket depth and measurement error. Sites with the deepest probing depths on the rst occasion may have advanced periodontal disease and by chance larger
measurement error (e.g. the positioning of the probe is not straight, too much force is used, etc.), but those errors do not necessarily recur on the second occasion. Regression to the mean can also occur when there is an intervention between two occasions, and this is one of the reasons that a placebo control group is required to estimate real treatment effects. However, regression to the mean has caused some confusion for clinicians (and also statisticians) regarding how to analyze the relationship between treatment effects and baseline disease levels (39, 55, 76, 78, 79). For instance, many periodontal studies found a strong association between pocket reduction (attachment level gain) and baseline pocket depth (attachment level). One common approach is to use correlation or regression to test this relationship. Suppose x1 is baseline pocket depth and x2 the follow-up pocket depth after an intervention. Many studies test the relationship between (x1 x2) and x1 and nd a strong, positive association. Another approach is to divide the patients according to their baseline disease status and then test whether the treatment outcomes differ (69). For example, teeth with infrabony lesions are divided into two groups according to their baseline pocket depth (e.g. deep vs. shallow) using the mean baseline pocket depth as the criteria. Researchers then compare the changes in pocket depth between the two groups after regenerative periodontal surgery is undertaken. Both approaches suffer the problem of regression to the mean and can give rise to misleading results. As detailed discussions of the problems with these two approaches can be found in our previous publications, we only provide a simple example to illustrate the problem. In a study on the treatment of infrabony lesions, 203 defects were treated with guided tissue regeneration (25). The average pocket depths at baseline and at 1-year of follow-up were 9 mm and 3.4 mm, respectively. Figure 2A shows that when teeth are grouped according to baseline pocket depth, there seems to be a converging trend in changes in pocket depth, suggesting that teeth with a deeper baseline pocket depth achieve greater pocket reduction after treatment. In contrast, Fig. 2B shows that when teeth are grouped according to follow-up pocket depth, there seems to be a diverging trend in changes in pocket depth, suggesting that teeth with a deeper baseline pocket depth achieve less pocket reduction after treatment. This example illustrates that the categorization approach is problematic, and contrasting results can be obtained from the analysis of the same data. More appropriate analyses of the
80
Fig. 2. Trends of the changes in probing pocket depth in 203 infrabony defects treated with guided tissue regeneration by categorizing the defects according to their baseline probing pocket depth (A) or follow-up probing pocket depth (B).
relationship between treatment effects and baseline disease status can be found in our previous publications and elsewhere (55, 74).
Mathematical coupling
The problem of mathematical coupling of data occurs when one variable has a mathematical relationship with other variables in statistical models (7, 8). Testing the relationship between treatment effects and baseline disease status, as discussed in the previous section on regression to the mean, is one example, and in this scenario, the two problems are almost synonymous (74). Another common example in periodontal research is the relationship among pocket depth, clinical attachment level and gingival recession (pocket depth + gingival recession = clinical attachment level). There are two main problems with the analysis of mathematically coupled variables. First, when testing the association between these variables, the usual null hypothesis that the correlation coefcient or regression slope is zero is no longer appropriate (1, 69). For example, suppose x1 and x2 are two sets of random numbers with the same variances; whilst their correlation is expected to be zero, the correlation between (x1x2) and x1 is close to 0.71 (1, 69), indicating that using zero correlation for statistical signicance testing is no longer valid and the associated P-value is misleading. Alternative methods are therefore required for proper analyses. The second problem is perfect collinearity amongst variables with mathematical coupling (71). For example, suppose we would like to test the relationships between changes in radiographical bone level, and baseline pocket depth, attachment level and gingival recession. As the latter three variables are mathematically related, they cannot be entered into an ordinary least-squares regression model simultaneously, and at least one needs to be removed from the model. This is because these three variables have only two dimensions, and mathematical computations cannot proceed without
dropping at least one of them. It is unusual that all the mathematically coupled variables are considered essential in explaining and predicting clinical outcomes, and the simple solution is to drop one or more of them based on their relative importance from a biological and clinical perspective. If all of the coupled variables are deemed essential in the statistical modeling process, more advanced methods, such as principal components and partial least-squares regression, may be useful (17, 73). We will discuss these methods in the next section on multicollinearity.
Multicollinearity
Collinearity or multicollinearity originally meant that one variable is a weighted combination of other variables (i.e. mathematically coupled), but now it usually means that explanatory variables in statistical models have large correlations (17, 77). For example, radiographical bone level is expected to be highly correlated with clinical attachment level and infrabony defect depth measured during surgery. If all three variables are entered into statistical regression models as explanatory variables, it becomes difcult to distinguish their individual relationship with the outcome. This may yield greater uncertainty in the estimation of both the regression coefcients and condence intervals, and the results are less likely to be replicated by other studies. A simple solution to this problem is to reduce the number of collinear variables. For example, only radiographic bone level and surgical measurement of infrabony defect depth are used as explanatory variables. However, it may not be easy to make decisions on the selection of variables, and sometimes we may need all of them in the same model. Complex solutions to this problem are to use alternative regression modeling strategies, such as ridge regression, principal components regression and partial least-squares regression (1, 14, 56, 57, 80, 84). These methods are generally known as shrinkage regression because they provide slightly biased estimates of regression coefcients but with greater
81
Tu & Gilthorpe
precision. In other words, the regression coefcients in these methods are different from the true ones in the population, but the condence intervals of these regression coefcients are much smaller than those in ordinary least-squares regression. The rationale of ridge regression is to add an increasing degree of noise to the data to reduce the collinearity until the regression coefcient becomes stable (17). Principal components regression and partial least-squares regression use a different strategy by creating latent constructs (known as components) that are weighted combinations of the original collinear variables, and these components are then used as new explanatory variables (1, 14, 57, 80, 84). The number of components that can be extracted is equal to the number of variables (when there is no perfect collinearity). When only a few components are used, regression coefcients from these methods are likely to be biased (because only part of the available information is used), but these coefcients have much smaller condence intervals (i.e. greater precision). The difference between principal components regression and partial least-squares regression lies in how components are extracted, and several studies show that partial least-squares regression performs better than principal components regression. Partial least-squares regression can be a very useful tool for periodontal research, as many clinical and radiographic measurements are either highly or even perfectly collinear. A demonstration of applying partial least-squares regression to periodontal research can be found in our previous study (73).
Observed Expected 1 Expected
Assessment of agreement
The assessment of agreement is often of paramount importance in dental research. For instance, one might consider the scenario where we wish to assess the agreement between measurements of probing pocket depth. This scenario could also take the form of a single observer wishing to check the repeatability of their observations: for example, how repeatable are measurements taken from probing pocket depth? It is often the case that research has involved more than one observer making the observations. In all scenarios, researchers wish to derive some estimate of the extent or measure of agreement. For categorical outcomes, in a pair of observers each makes an observation of a binary outcome. The ubiquitous statistic of choice in assessing agreement is Cohens Kappa (18), which is dened as the chance corrected proportional agreement and is given by:
where Observed is the proportion of observed agreement and Expected is the proportion of expected agreement. Kappa takes values in the range of )1 to +1, where a value of )1 indicates total disagreement and +1 indicates total agreement. Thus, j = 0 indicates that the level of agreement is no better than would be expected by chance. Landis & Koch (45) gave benchmarks in interpreting Kappa (Table 1). It is common practice to compare the estimated value of Kappa obtained against these guidelines. These guidelines were never intended to be applied strictly or blindly, and their use is criticized (48, 49). Since Cohen proposed the Kappa statistic, many extensions have been developed to consider beyond the 2 2 case, to include the situations of many observers and or outcomes with many categories (both nominal and ordinal) (9, 19, 21, 37, 50, 66). For the purpose of this paper we remain focussed on the 2 2 case, as the key points made here about Kappa extend to all other examples of Kappa. There are many problems with Kappa that are repeatedly overlooked, possibly because the alternative statistical strategy is deemed too complicated amongst nonstatisticians. The rst problem is that Kappa is dependent on the number of categories: the fewer the categories (i.e. the more a table is collapsed), the higher the value of Kappa. This is perhaps obvious because the fewer categories one has to choose from the more likely one is to agree because there is less scope to disagree. A second issue relates to the effect that the outcome prevalence has on the estimate of Kappa and the paradoxes it thereby causes, as the value of Kappa depends heavily on the prevalence of the observation being measured (15, 26, 47). Consider the situation where the proportion of agreement is 80% (Table 2A Observed). Supercially, this looks quite good. The corresponding value of Kappa, however, is )0.11 (because the observed proportional agreement is 0.80 whilst the expected proportional agreement is 0.82),
Table 1. Landis & Koch benchmarks
Value of j 0.20 0.210.40 0.410.60 0.610.80 0.811.00 Strength of agreement Poor Fair Moderate Good Very good
82
in other words, the agreement is worse than expected by chance alone. If we maintain the proportion of agreement at 80% but change the prevalence (Table 2B Observed], Kappa is 0.60. The differences in the expected proportions under these conditions can be seen in Table 2A Expected and 2B Expected. Consequently, a table of frequencies should always be considered alongside Kappa. Although Kappa is generally considered to lie between )1 and +1 (where a value of )1 is thought to indicate the position of perfect disagreement), theoretically this minimum is only possible when the total disagreement between observers is balanced. In all other cases, the value of Kappa is >)1, and when disagreement is unbalanced in the extreme (the two observers always disagree), perfect disagreement reaches a value of zero. Although we usually would not expect to nd perfect disagreement, it further illustrates the difculty in the interpretation of Kappa and its dependence on prevalence.
Table 2. Observed and expected proportions for two scenarios of equal prevalence at 80%.
A Observed Yes Yes No Total 80 10 90 Expected Yes Yes No Total B Observed Yes Yes No Total 40 10 50 Expected Yes Yes No Total 25 25 50 No 25 25 50 50 50 100 No 10 40 50 50 50 100 Total Total 81 9 90 No 9 1 10 90 10 100 No 10 0 10 90 10 100 Total Total
Now, question whether or not observations are actually made blind. For illustration, suppose that nine of 10 people with periodontal diseases have chronic periodontitis and only one in 10 has aggressive periodontitis. Clinicians are likely to diagnose chronic periodontitis in nine out of 10 cases, before even seeing the patients. The expected probability that two observers agree by chance would be the probability that they both diagnose chronic periodontitis (0.9 0.9) plus the probability that they both diagnose aggressive periodontitis (0.1 0.1), which is 0.82. Under the nave assumptions surrounding Kappa, the expected probability that two clinicians would agree by chance would be 0.5 (0.5 0.5 + 0.5 0.5). This questions whether the expected agreement used by Kappa in adjusting for chance is correct. If the prevalence lies away from 50%, this consideration increases in importance. If agreement is assessed on a subsample with a different prevalence from that of the sample to be studied, the level of agreement for the sample cannot be inferred. This is important when using Kappa to undertake calibration, where a range of observations are chosen to include all possible points on a scale (the rationale being that it is good to gain experience in using the full range). The prevalence in the calibration exercise is not equal to that of the sample to be observed; the inference of successful calibration based on Kappa may thus be erroneous for the sample to be studied. Furthermore, Kappa is typically evaluated against the wrong null hypothesis. The standard null against which Kappa is usually tested is zero, yet this is erroneous because one would expect trained clinicians to have considerable agreement and therefore achieve agreement much better than chance. If Kappa were to be tested, it should be for equivalence to one. As Kappa is bound at 1, this is strictly the same as noninferiority to 1. One method for doing this is to use the condence interval approach. It remains imperative to consider observer agreement, especially to consider the impact of observer variation in studies. This should not be achieved by selecting a subset of the study sample (thereby leading to low statistical power), calculating Kappa to illustrate that observers do not statistically signicantly differ in their observations (against the wrong null hypothesis) and thus concluding that differences amongst observers can be ignored. Researchers should treat the (potential) effect of the different observers not as a negligible nuisance to be ignored, but as an inevitable consequence of the study design and embrace observer variability effects
83
Tu & Gilthorpe
in any subsequent statistical analysis. This can readily be done for a number of observers by adopting more sophisticated statistical techniques, such as multilevel modeling (2935, 38, 43, 81) or generalized estimating equations (46), each of which also addresses the pervasive problem within dental research of clustered outcomes data (see next section). Such analytical strategies need to be adopted at the planning phase, however, as the use of these methods tends to substantially affect the study design.
surements are aggregated into subject level, it becomes difcult to assess the impact of furcation and tooth mobility, which are local factors. Multilevel analysis provides greater exibility in statistical modeling and is also a framework for testing complex interactions between variables operating at different levels.
Repeated measurements
To ascertain whether interventions have genuine effects, repeated observations are required to establish the causal effects. It is common in epidemiological research to have multiple measurements of outcomes over a period of time. Analysis of repeated measurements can be a challenging task when the number of measurements is large and the trend in changes is not linear. Most periodontal trials measure treatment outcomes at baseline and follow-up after the intervention. Some have more than one follow-up, and usually changes in outcomes from baseline are calculated for different follow-up time-points. For instance, when there are two follow-ups, two change scores are calculated and differences in these two change scores between the treatment and control groups are tested. The statistical analysis in this approach is relatively straightforward, and if the differences in the outcomes during the interim period are also of research interest, this approach may be justied. However, this approach has the problem of multiple testing, as discussed previously, because comparisons of different groups are undertaken at multiple time-points independently. Yet, if research interest is the group differences in the general trend in outcome changes, the multiplechange scores approach is less useful. Another problem is that whilst some covariates remain constant (e.g. gender) over the observation period, some may change (e.g. age). For instance, suppose a study wants to investigate the associations between the number of certain periodontal pathogens and the progression of periodontal diseases at the site level for patients with a different genetic polymorphism. As the number of periodontal pathogens changes over time for each site, sophisticated statistical models are required to analyze the dynamic interactions among genetics, environment (i.e. bacteria) and periodontal disease. Many advanced statistical tools are available for sophisticated analysis of repeated measurements. As repeated measurements can be viewed as clustered data (multiple measurements within subjects),
Subject-based vs. site-based analyses

One unique feature of periodontal data is that some variables are measured at the site level or the tooth level, whilst others are measured at the subject level. From a clinical viewpoint, sites and teeth within the same patient may display quite different disease proles. For example, even in patients with advanced periodontal diseases, whilst many teeth show great periodontal pocketing and bone loss, a few teeth seem to be unaffected. Therefore, site-based or toothbased analysis seems to be preferred to subject-based analysis, as the latter takes the average values across sites and teeth; hence, much information is lost in the averaging. However, it is not appropriate to treat sites and teeth in the same patient as independent units, as they share the same environment, such as the general immune response and systemic health. Before the development of sophisticated multilevel statistical methodologies such as multilevel modeling (31, 35, 40, 73) and generalized estimating equations (46, 81), many periodontal studies undertook sitelevel analysis without properly taking into account the clustering of sites and teeth within patients. This can give rise to inated sample sizes and overly optimistic condence intervals. Many periodontal researchers have become aware of this problem, and multilevel modeling and generalized estimating equations have now been widely adopted.
Aggregated vs. multilevel analysis

Another advantage of multilevel analysis is that it can properly specify the clinical structure of periodontal data and estimate the relationship between the outcome and explanatory variables at different levels. For example, furcation is a site-level variable, tooth mobility a tooth-level variable and cigarette smoking a subject-level variable. They affect the progression and treatment outcome of periodontal disease at different levels. If the site-level and tooth-level mea-
84
multilevel modeling and generalized estimating equations are frequently used for this purpose. Two specic issues need to be considered in multilevel modeling and generalized estimating equations. The rst is the specication of the trend. In many periodontal trials on treatment effects, most changes in outcomes usually take place in the rst few weeks or months, and if the trend is not correctly specied, this may yield misleading conclusions in estimating the treatment effects between groups. The second issue is the autocorrelation between repeated measurements. Autocorrelation means the correlation amongst successive measurements; measurements made close in time are generally more correlated than those measured far apart in time. Incorrect autocorrelation structures may result in the underestimation or the overestimation of treatment effects. Using generalized estimating equations requires explicit specication of the trend in the outcome (i.e. the pattern of changes over time) and the autocorrelation amongst repeated measurements. In multilevel modeling, the specication of the trend in the outcome is the xed effects, and the modeling of autocorrelation is known as the random effects. The difference between the two approaches is that generalized estimating equations only estimate the average trend for the population, whilst multilevel modeling, under certain assumptions, estimates the trend for each individual and also obtains the average trend for the population. The autocorrelation is explicitly specied in generalized estimating equations by setting up correlation structures amongst repeated measurements, and there are many covariance structures proposed in the literature. Suppose there are four repeated measurements, so the covariance structure of residuals contains four variances and six covariances. This is usually written as a 4 4 matrix: e1 6 cv12 V6 4 cv13 cv14 2 3 cv12 cv13 cv14 e2 cv23 cv24 7 7; cv23 e3 cv34 5 cv24 cv34 e4
RUN
1 6 q12 6 4 q13 q14
q12 1 q23 q24
q13 q23 1 q34
3 q14 q24 7 7; q34 5 1
where qij (i and j = 1 to 4, but i j), is the correlation coefcient. There are six correlations, q12 to q34, in RUN that need to be estimated. Apparently, RUN will be very close to the observed correlation in the data, but it will also require more degrees of freedom to be modeled. RUN is the most complex correlation structure and is known as the unstructured autocorrelation. A simpler structure is known as the rstorder autoregressive covariance structure and is written as: 2 3 1 q q2 q3 6 q 1 q q2 7 7; RAR1 6 2 4q q 1 q5 q3 q2 q 1 where the correlations decrease with increasing time separation by the degree of q (0 q 1). Therefore, there is only one parameter in RAR(1) that needs to be estimated: q. Another simple covariance structure is an exchangeable or compound symmetry covariance structure: 2 3 1 q q q 6q 1 q q7 7 RCS 6 4 q q 1 q 5; q q q 1 where the correlation between any two residuals is also assumed to be identical. Therefore, there is also only one parameter in RCS that needs to be estimated: q. Which autocorrelation structure should be used is determined empirically (i.e. different structures suit different data sets). In multilevel modeling, the estimation of the random effects is related to how the individual trend is modeled. Variations in the trend, such as the variances of the model intercept and slope, and their covariance, determine the autocorrelation structure. However, when necessary, an explicit specication of the autocorrelation structure can be added to the multilevel modeling to improve model t. One approach commonly used in social sciences for analyzing repeated measurements is latent growth-curve modeling (13, 24, 72). Latent growthcurve modeling is equivalent to multilevel modeling in many aspects, but software packages for latent growth-curve modeling provide additional exibilities in modeling trends. One important advantage of latent growth-curve modeling over multilevel mod-
where e1 to e4 are the residual variances on the rst to fourth occasions, respectively, and cvij (i, j = 1 to 4, but i j) is the covariance between the ith and jth occasions (e.g. cv12 is the covariance between occasions 1 and 2). It is well known that the correlation coefcient between (for example) occasions 1 p and 2 (q12) is given as q12 cv12 e1 e2. Following the above notation, the correlation matrix R can be written as:
85
Tu & Gilthorpe
eling is that it is straightforward to incorporate latent variables into a model building process.
Concluding remarks and future developments

Many clinical researchers regard statistics as a collection of mathematical tools that generate P-values for publications. This is very unfortunate because statistical analysis is not just a testing process, but also a thinking process. With the help of computers, most statistical analyses in periodontal research can be carried out within minutes or seconds, and this should give researchers more time for statistical thinking. Whilst statistics is mathematical in essence, an intuitive conceptual understanding can still be acquired, and it is crucial for clinical researchers to have a basic understanding of statistical concepts relevant to their research. Certainly, we welcome greater collaboration between periodontal researchers and statisticians. Nevertheless, to bring out the best in this collaboration, periodontal researchers have to acquire a basic statistical training, which is instrumental in yielding interesting research hypotheses and proper study design. Conversely, in our opinion, statisticians also need to obtain a working knowledge of periodontology, and this would only happen if the collaboration between both parties is regular and well-engaged. Advanced statistical methods have been applied with increasing frequency in periodontal research, and new methods have been developed to analyze complex periodontal data from a new perspective. This creates a huge challenge but also a great opportunity for the periodontal research community that comprises researchers, consumers (e.g. practicing periodontists and patients) and journal editors. For instance, there is a large amount of literature on the associations between periodontal diseases and systemic health, and most are observational studies that require careful adjustment of confounding and cautious interpretation of causal relationships (23, 42, 75). We are aware of many instances where researchers have been too hasty to embrace a causal interpretation of associations without carefully considering the limitations of their study design and data collection. These caveats could have been ratied in the peer-review process if suitable reviewers (e.g. periodontists with epidemiological training) had been available. Periodontal research has taken advantage of many new developments in other disciplines, such as
microbiology and molecular biology, over the last few decades. Many periodontal researchers are also microbiologists or immunologists, but few are also statisticians. The barrier is likely to be the language used in statistics: mathematics, which is highly abstract, often seems impenetrable. In our judgment, the periodontal community needs a collective effort to break this barrier to improve research quality and embrace new opportunities. Many medical journals publish educational papers on statistical issues for general medical readers, but this is rare in dental journals. One exception is a series of excellent review papers on Further Statistics in Dentistry, published in the British Dental Journal in 2002 (52). Although courses on basic statistics are taught in dental schools for undergraduate and postgraduate students, training courses on statistical analyses specically designed for dental researchers are very rare. Collaboration of periodontal researchers with statisticians should become the norm rather than the exception or a luxury. This can only occur if more resources are allocated to support statisticians working with dental researchers and if more statisticians take a proactive interest in dental research.
References
1. Abdi H. Partial least squares regression and projection on latent structure regression (PLS Regression). Wiley Interdiscip Rev Comput Stat 2010: 2: 97106. 2. Altman DG. Statistics in medical journals. Stat Med 1982: 1: 5971. 3. Altman DG. Statistics in medical journals developments in the 1980s. Stat Med 1991: 10: 18971913. 4. Altman DG. Statistical reviewing for medical journals. Stat Med 1998: 17: 26612674. 5. Altman DG, Bland JM. Improving doctors understanding of statistics (with discussion). J R Stat Soc Ser A Stat Soc 1991: 154: 223267. 6. Altman DG, Bland JM. Standard deviations and standard errors. BMJ 2005: 331: 903. 7. Andersen B. Methodological Errors in Medical Research. London: Blackwell, 1990. 8. Archie JP. Mathematical Coupling: a common source of error. Ann Surg 1981: 193: 296303. 9. Banerjee M. Beyond kappa: a review of interrater agreement measures. Can J Stat 1999: 27: 323. 10. Blance A, Tu Y-K, Baelum V, Gilthorpe MS. Statistical issues on the analysis of change in follow-up studies in dental research. Community Dent Oral Epidemiol 2007: 35: 412 420. 11. Bland M. An Introduction to Medical Statistics. Oxford: Oxford University Press, 2000. 12. Bland JM, Altman DG. Regression towards the mean. BMJ 1994: 308: 1499.
86

13. Bollen KA, Curran PJ. Latent Curve Models. Hoboken, NJ: Wiley, 2006. 14. Boulesteix AL, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform 2007: 8: 3244. 15. Byrt T, Bishop J, Carlin JB. Bias, Prevalence and Kappa. J Clin Epidemiol 1993: 46: 423429. 16. Campbell MJ, Swinscow TDV. Statistics at Square One, 11th edn. Chichester: John Wiley & Sons Ltd, 2006. 17. Chatterjee S, Hadi AS, Price B. Regression Analysis by Example, 3rd edn. New York: John Wiley & Sons, 2000. 18. Cohen J. A Coefcient of agreement for nominal scales. Educ Psychol Measur 1960: 20: 3746. 19. Cohen J. Weighted Kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 1968: 70: 213220. 20. Cohen ME. Analysis of ordinal dental data: evaluation of conicting recommendations. J Dent Res 2001: 80: 309313. 21. Davies M, Fleiss JL. Measuring agreement for multinomial data. Biometrics 1982: 38: 10471051. 22. Dawson B, Trapp RG. Basic & Clinical Biostatistics. New York: McGraw-Hill, 2001. 23. Dietrich T, Jimenez M, Krall Kaye EA, Vokonas PS, Garcia RI. Age-dependent associations between chronic periodontitis edentulism and risk of coronary heart disease. Circulation 2008: 117: 16681674. 24. Duncan TE, Duncan SC, Strycker LA. An Introduction to Latent Variable Growth Curve Modeling, 2nd edn. Mahwah, NJ: Laurence Erlbaum Associates, 2006. 25. Falk H, Laurell L, Ravald N, Teiwik A, Persson R. Guided tissue regeneration therapy of 203 consecutively treated intrabony defects using a bioabsorbable matrix barrier. Clinical and radiographical ndings. J Periodontol 1997: 68: 571581. 26. Feinstein AR, Cicchetti DV. High agreement but low kappa: 1 The problems of 2 paradoxes. J Clin Epidemiol 1990: 43: 543549. 27. Friedman M. Do old fallacies ever die? J Econ Lit 1992: 30: 21292132. 28. Galton F. Regression toward mediocrity in hereditary stature. J R Anthropol Inst 1986: 15: 246263. 29. Gilthorpe MS, Grifths GS, Maddick IH, Zamzuri AT. The application of multilevel modelling to periodontal research data. Community Dent Health 2000: 17: 227235. 30. Gilthorpe MS, Grifths GS, Maddick IH, Zamzuri AT. An application of multilevel modelling to longitudinal periodontal research data. Community Dent Health 2001: 18: 7986. 31. Gilthorpe MS, Maddick IH, Petrie A. Introduction to multilevel modelling in dental research. Community Dent Health 2000: 17: 222226. 32. Gilthorpe MS, Zamzuri AT, Grifths GS, Maddick IH, Eaton KA, Johnson NW. Unication of the burst and linear theories of periodontal disease progression: a multilevel manifestation of the same phenomenon. J Dent Res 2003: 82: 200205. 33. Glantz S. Biostatistics: how to detect, correct, and prevent errors in the medical literature. Circulation 1980: 61: 17. 34. Glantz SA. Primer of Biostatistics, 5th edn. New York: McGraw-Hill, 2002. 35. Goldstein H. Multilevel Statistical Models, 2nd edn. New York: John Wiley, 1995. 36. Goodman SN. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 1993: 137: 485. 37. Graham P, Jackson R. The analysis of ordinal agreement data beyond weighted Kappa. J Clin Epidemiol 1993: 46: 10551062. 38. Hart A. MannWhitney test is not just a test of medians: differences in spread can be important. BMJ 2001: 323: 391393. 39. Hayes RJ. Methods for assessing whether change depends on initial value. Stat Med 1988: 7: 915927. 40. Hox J. Multilevel Analysis. Mahwah, NJ: Laurence Erlbaum Associates, 2002. 41. Hubbard R, Lindsay RM. Why P values are not a useful measure of evidence in statistical signicance testing. Theory Psychol 2008: 18: 6988. 42. Hujoel P. Dietary carbohydrates and dental-systemic diseases. J Dent Res 2009: 88: 490502. 43. Jill J. The insignicance of null hypothesis signicance testing. Polit Res Q 1999: 52: 647674. 44. Kirkwood B, Sterne JAC. Essential Medical Statistics. Oxford: Blackwell, 2003. 45. Landis JR, Koch GG. Measurement of observer agreement for categorical data. Biometrics 1977: 33: 159174. 46. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986: 73: 1322. 47. Lord FM. A paradox in the interpretation of group comparisons. Psychol Bull 1967: 68: 304305. 48. Ludbrook J. Comparing methods of measurement. Clin Exp Pharmacol Physiol 1997: 24: 193203. 49. Ludbrook J. Statistical techniques for comparing measurers and methods of measurement: a critical review. Clin Exp Pharmacol Physiol 2002: 29: 527536. 50. Mak TK. Analyzing intraclass correlation for dichotomous variables. J R Stat Soc Ser C Appl Stat 1988: 37: 344352. 51. Mohr LB. Regression artifacts and other customs of dubious desert. Eval Program Plann 2000: 23: 397409. 52. Moles D. Further statistics in dentistry: Introduction. Br Dent J 2002: 193: 375. 53. Newton RR, Rudestam KE. Your Statistical Consultant. Thousand Oaks: Sage Publication, 1999. 54. Nickerson RS. Null hypothesis signicance testing: a review of an old and continuing controversy. Psychol Methods 2000: 5: 241301. 55. Oldham PD. A note on the analysis of repeated measurements of the same subjects. J Chronic Dis 1962: 15: 969 977. 56. Phatak A, de Jong S. The geometry of partial least squares. J Chemom 1997: 11: 311338. 57. Poole C. Multiple comparisons? No problem! Epidemiology 1991: 2: 241243. 58. Rice JA. Mathematical Statistics and Data Analysis, 2nd edn. Belmont, CA: Duxbury Press, 1994. 59. Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology 1990: 1: 4346. 60. Rothman KJ, Greenland S, Lash TL. Modern Epidemiology, 3rd edn. Philadelphia: Lippincott Williams & Wilkins, 2008. 61. Savitz DA, Olshan AF. Multiple comparisons and related issues in the interpretation of epidemiologic data. Am J Epidemiol 1995: 142: 904908. 62. Senn SJ. Cross-Over Trials in Clinical Research, 2nd edn. Chichester: John Wiley & Sons, 2002.
87
Tu & Gilthorpe
63. Sterne JAC. Teaching hypothesis tests time for signicant change? Stat Med 2001: 21: 985994. 64. Sterne JAC, Davey Smith G. Sifting the evidencewhats wrong with signicance tests? BMI 2001: 322: 226231. 65. Stigler SM. Regression towards the mean, historically considered. Stat Methods Med Res 1997: 6: 103114. 66. Tanner MA, Young MA. Modeling agreement among raters. J Am Stat Assoc 1985: 80: 175180. 67. Thomas DC, Clayton DG. Betting odds and genetic associations. J Natl Cancer Inst 2004: 96: 421423. 68. Thompson B. If statistical signicance tests are broken misused, what practices should supplement or replace them? Theory Psychol 1999: 9: 165181. 69. Tu YK, Baelum V, Gilthorpe MS. The relationship between baseline value and its change: problems in categorization and the proposal of a new method. Eur J Oral Sci 2005: 113: 279288. 70. Tu YK, Blance A, Clerehugh V, Gilthorpe MS. Statistical power for analyses of change in randomized controlled trials. J Dent Res 2005: 84: 283287. 71. Tu YK, Clerehugh V, Gilthorpe MS. Collinearity in linear regression is a serious problem in oral health research. Eur J Oral Sci 2004: 112: 389397. 72. Tu YK, DAiuto F, Baelum V, Gilthorpe MS. An introduction to latent growth curve modelling for longitudinal continuous data in dental research. Eur J Oral Sci 2009: 117: 343 350. 73. Tu YK, Galobardes B, Smith GD, McCarron P, Jeffreys M, Gilthorpe MS. Associations between tooth loss and mortality patterns in the Glasgow Alumni Cohort. Heart 2007: 93: 10981103. 74. Tu YK, Gilthorpe MS. Revisiting the relation between change and initial value: a review and evaluation. Stat Med 2007: 26: 443457. 75. Tu YK, Gilthorpe MS, D Aiuto F, Woolston A, Clerehugh V. Partial least squares path modelling for relations between baseline factors and treatment outcomes in periodontal regeneration. J Clin Periodontol 2009: 36: 984995. 76. Tu YK, Gilthorpe MS, Grifths GS. Is reduction of pocket probing depth correlated with the baseline value or is it mathematical coupling? J Dent Res 2002: 81: 722726. 77. Tu YK, Kellett M, Clerehugh V, Gilthorpe MS. Problems of correlations between explanatory variables in multiple regression analyses in the dental literature. Br Dent J 2005: 199: 457461. 78. Tu YK, Law GR. Re-examining the associations between family backgrounds and childrens cognitive developments in early ages. Early Child Dev Care 2010: 180: 12431252. 79. Tu YK, Maddick IH, Grifths GS, Gilthorpe MS. Mathematical coupling can undermine the statistical assessment of clinical research: illustration from the treatment of guided tissue regeneration. J Dent 2004: 32: 133142. 80. Tu YK, Woolston A, Baxter PD, Gilthorpe MS. Accessing the impact of body size in childhood and adolescence on blood pressure: an application of partial least squares regression. Epidemiology 2010: 21: 440448. 81. Twisk JWR. Applied Longitudinal Data Analysis for Epidemiology. Cambridge: Cambridge University Press, 2003. 82. Vickers AJ. The use of percentage changes from baseline as an outcome in a controlled trial is statistically inefcient: a simulation study. BMC Med Res Methodol 2001: 1: 6. 83. Wainer H, Brown LM. Two statistical paradoxes in the interpretation of group differences: illustrated with medical school admission and licensing data. Am Stat 2004: 58: 117123. 84. Wold S, Sjostrom M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intelligent Lab Syst 2001: 58: 109130.
88

Key Statistical and Analytical Issues For Evaluating Treatment Effects in Periodontal Research

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Key Statistical and Analytical Issues For Evaluating Treatment Effects in Periodontal Research

Hochgeladen von

Copyright:

Verfügbare Formate

Periodontology 2000, Vol. 59, 2012, 7588 Printed in Singapore.

All rights reserved

2012 John Wiley & Sons A/S

Standard deviation vs. standard error of the mean

Condence intervals vs. P-values

Statistical and analytical issues in periodontal research

Statistical power sample size calculation

Clinical vs. statistical signicance

t-test vs. MannWhitney U-test (a parametric test vs. a nonparametric test)

Statistical and analytical issues in periodontal research

t-test vs. analysis of covariance

Adjustment for multiple testing

Regression to the mean

Statistical and analytical issues in periodontal research

Observed Expected 1 Expected

Statistical and analytical issues in periodontal research

Subject-based vs. site-based analyses

Aggregated vs. multilevel analysis

Statistical and analytical issues in periodontal research

1 6 q12 6 4 q13 q14

q12 1 q23 q24

q13 q23 1 q34

3 q14 q24 7 7; q34 5 1

Concluding remarks and future developments

Statistical and analytical issues in periodontal research

Das könnte Ihnen auch gefallen