RESEARCH
A A Qureshi T Ibrahim
Abstract
An understanding of the basic principles of statistical analysis is vital before commencing research. This article aims to provide a concise over view of this extensive subject, highlighting the important concepts. Statis tical analysis should be considered at the planning stage of any study so as to establish hypotheses, specify the primary outcome of interest and undertake a sample power calculation. The research question, scale of measurement and distribution of the outcome variable all have a bearing on the appropriate choice of statistical test. A statistical test can only be employed if the distribution assumptions of the test have been met. The interpretation of signiﬁcance must be tempered by limitations of the method of analysis, as well as recognizing the variability of the effect of interest using an interval estimate. The various descriptive statistics in diagnostic studies are also explored.
Keywords conﬁdence intervals; p values; power calculation; statistics
Introduction e the importance of statistics in clinical practice
A fundamental understanding of statistical analysis is a neces sary prerequisite to undertaking clinical research. Despite this, many otherwise well designed studies are let down by poor analysis and incorrect application of tests as an unfortunate consequence of insufﬁcient knowledge or attention being devoted to this vital part of the research. To a certain extent this reﬂects deﬁciencies amongst clinicians in understanding and implementing statistical methods. Although importance is attached to study design, critical analysis of research based on the appropriate use of statistics is often suboptimal. All too frequently an assigned p value assumes overwhelming impor tance in the results of a study and has been demonstrated to be a source of publication bias. ^{1} This may have signiﬁcant health care implications if ineffective, costly treatments are adopted whilst beneﬁcial interventions are marginalized despite evidence that is not capable of standing up to scientiﬁc scrutiny. The demands generated by the ever increasing development of medical technologies cannot be met by ﬁnite healthcare resources. The appropriate utilization of resources alongside
A A Qureshi MB BS MSc MRCS Specialist trainee in Orthopaedic Surgery, University Hospitals of Leicester, Leicester Royal Inﬁrmary, Leicester LE1 5WW, UK.
T Ibrahim MB BS(Hons.) MD FRCS(Tr & Orth) Clinical Lecturer in Orthopaedic Surgery, University of Leicester, Leicester General Hospital, Leicester LE5 4PW, UK.
improved patient care are the anticipated sequelae of evidence based practice. Statistical science will always remain an essential step in the progression of this paradigm. The purpose of this article is to highlight the basic principles of statistical analysis in orthopaedic research without reference to complex mathematical theorems or scientiﬁc proofs. The appropriate use of statistics is intimately related to the major considerations in study design and ultimately is driven by the research question. Thus, it is crucial that statistical analysis is considered at an early stage in the inception of a study as this can help to avoid several potential pitfalls later on. Although preliminary discussion with a statistician is beneﬁcial to a study, it is no substitute for a basic grounding in statistical methods amongst the trial developers. Important concepts and consider ations relating to the design of studies have been covered in our previous article. ^{2} The intention of this article is to deliver a concise overview of how statistics can be appropriately utilized to generate robust ﬁndings from a study. This understanding is twinned with the acquisition of skills enabling critical analysis of the interpretation and presentation of results in scientiﬁc papers, ultimately endowing the reader with the insight to question the legitimacy of the conclusions drawn from any study they read.
Why are statistics necessary?
Scientiﬁc reasoning has traditionally involved considering enti ties as discrete and absolute; where measurements are unwa vering despite endless repetition or altered circumstances. However, even on the smallest conceivable scale of observation this perspective has shifted. The birth of quantum mechanics in the late 19th century arose from the realization that phenomena involving electrons could not be explained in terms of classic mechanics. ^{3} Heisenberg’s uncertainty principle proposed that an electron’s spatial location was best understood as existing within a cloud of probability where a precise location was “uncertain.” This uncertainty arises from the understanding that there are countless factors exerting varying magnitudes of inﬂuence on the behaviour of this smallest of species. As the complexity of the phenomenon of interest increases, the number of governing factors and the accompanying uncertainty must doubtlessly increase. Thus it can be seen that when biological systems are
subjected to scientiﬁc observation, the one intended true measure of a variable is rarely observed due to variation in the phenomenon of interest as a result of the complex interplay of competing inﬂuences. In essence, this is why we call these measured properties variables e because they vary and this variation is often described as random. Usually we are investi gating the effect of altering a variable, known as the independent variable, on the behaviour of an outcome or dependent variable. With the understanding that all variables are subject to variation, the extent of which is related to the number of determining factors, we can elevate our thinking to consider the following points:
Can we quantify the observed variation for a particular dependent variable? e.g. What is 10 year survivorship of a speciﬁc total hip replacement? Can we determine which independent variables are important in determining the extent of this variation?
463
2010 Elsevier Ltd. All rights reserved.
RESEARCH
e.g. Age of patient, BMI, postoperative infection, length of inpatient stay e which of these variables affect survivorship? Can we determine the direction of effect when we change one of the independent variables? e.g. Is an increased body mass index associated with enhanced survivorship or early failure? Can we determine the magnitude of effect when we change one of the independent variables? What is the difference in survivorship in number of years between a patient aged 60 years and a patient aged 70 years? Can we formulate a predictive model to explain the variation of the dependent variable? By knowing the patient age, BMI, and the nature of any postoperative complications can we predict the length of survivorship of the implant? Statistical science attempts to address such questions. The most comprehensive deﬁnition of statistics is given by Kirk wood ^{4} as “the science of collecting, summarizing, presenting, and interpreting data, and the using of them to test hypoth eses.” It is logical to surmise that the quality of statistical analysis is directly related to the quality of the data and by extension of the study itself. Effective study design is the crucial foundation of research. Well formulated statistical analysis can be rendered redundant by a poorly executed study, whereas a well designed study marred by poor statis tical analysis can still be redeemed by repeat analysis. The research question itself drives the study design and in turn the analysis. Deﬁciencies in the planning stage are one of the main reasons for poor studies. These elements may be considered as sequential mechanisms within the research engine (Figure 1). The engine will not start if one of the early gears has failed even if the subsequent ones have been furnished to perfection. Conversely, poor statistical analysis in a well designed study will permit limited conclusions. The importance of study design will not be dealt with here, having been covered in our previous article. Let us instead consider that we have ﬁnished a study, gathered the results and are now looking to analyze the data.
What type of data do we have?
Correct statistical analysis depends on the scale of measurement of the variables, as this determines the distribution of the data and the appropriate statistical tests. Data generated from studies can be broadly split into two scale types e categorical which is fundamen tally qualitative and does not possess numerical properties and numerical data in which quantitative analysis is possible (Figure 2).
Categorical data
This can be of two types:
Nominal: the sole property is that the data can be named, enabling distinction. For example e hair colour, blood group, type of hip prosthesis. The only further level of differentiation is that of equivalence. No ranking of this dataset is possible.
Ordinal: although different categories exist, it is possible to rank these data in a speciﬁc order. An example is the Likert scale used in questionnaires whereby patients register their agreement to a statement by choosing one of ﬁve categories e strongly agree, agree, neither agree nor disagree, disagree and strongly disagree. Although these responses can be ranked in an order of increasing agreement, the relationships between different ranks cannot be precisely deﬁned. Although such scales generate numerical data we should not treat this as quantitative data e.g. comparing the sums of scores on such questionnaires.
Numerical data
These data are quantitative and two scales of measurement are possible.
Discrete: these data consists of counts or frequencies. The vari able can only assume a ﬁnite number of possibilities where in between values do not exist e.g. number of operations.
Continuous: measurements can take any value within a speciﬁed range. Examples include the SI units of mass or distance. A further subclassiﬁcation is possible:
Figure 1 Statistical analysis in the research sequence.
464
2010 Elsevier Ltd. All rights reserved.
RESEARCH
Figure 2 Scale of measurement of different variables.
Interval e the difference between measurements (interval) has meaning but the ratio does not. Examples include relative measures e.g. measuring walking distance after 1 km. We can say that a patient who scores 300 m (actually 1300 m) has walked 150 m more than a patient who has walked 150 m (actually 1150 m) but we cannot say that the ﬁrst patient has walked twice the distance of the second. Ratio e these data have the added beneﬁt that the ratio between values as well as the interval carries meaning. Taking the above example, measuring the actual rather than the relative walking distance gives us ratio data. This scale is only possible if the value of zero has a true meaning. As we progress down this list of scale types from nominal to continuous, a greater extent of information is engendered by the data. We can simplify data from continuous to ordinal by grouping values together but such compression leads to loss of information; transforming the distribution of the data limits the statistical tests that we can use.
How are my data distributed?
Having determined our scale of measurement, we can now determine the distribution of the variable through graphical assessment. Plotting the distribution allows us to make judge ments regarding the most typical value of our variable and the extent of spread or dispersion around this. If our dataset consists of categorical data, we can graphically compare counts or frequencies between the different groups on a bar chart or pie chart. From this we can determine the most commonly occurring score and the relative ratios between different scores. However, if our data have a continuous scale of measurement, we can construct a histogram (Figure 3). Histo grams are not bar charts. They should be considered as frequency distribution charts with distinct mathematical prop erties. The width of the bar corresponds to the actual limits of that class interval i.e. the range of values for which the cumu lative frequency has been determined. Smaller class intervals better deﬁne the shape of the distribution and give an idea as to what may be happening to our variable between class intervals. This is why it is best not to compress data through widening the
class interval as this distorts the distribution. If we connect the midpoints of each bar, a best ﬁt curve can be mapped to these points, known as a frequency polygraph, enabling interpolation of values between the class intervals (Figure 3). The shape of the frequency polygraph curve is very important. Curves are mathematical functions and statistical tests can be derived from these functions to infer the properties of a variable ﬁtting a particular distribution. The statistical test is only valid if the dataset satisﬁes the assumptions of the appropriate distri bution. Rather than determining the actual mathematical func
tion of a distribution curve, a far easier approach is to look at its shape. Most curves form an approximate bell shaped distribu tion, with a peak ﬂanked by two variable tails which taper off to the outliers. There are three important aspects of the shape of the curve:
1. Modality e a curve is unimodal if it has one peak i.e. one mode. A bimodal curve has two peaks and so on.
2. Skewness e this relates to symmetry of the tails. A curve is positively skewed if most of the scores are clustered at the higher end of the spectrum making this tail larger. A nega tively skewed distribution is the converse of this with most of the scores being low (Figure 4).
3. Kurtosis e this describes how ﬂat the curve is. In essence, how the data are distributed about the peak.
How do we describe our dataset?
Let us now consider the frequency distribution as a visual depiction of the variable of interest in a population rather than within a dataset. The population does not have to consist of individuals e it is simply a set of occurrences of that variable. If we can measure each and every occurrence of a particular vari able we will have a frequency distribution of the population. In most cases, we are unable to obtain measurements for the entire population and restrict ourselves to a representative sample. Through random sampling, this sample should embody all of the characteristics of the population of interest. This is an important point which will be explored later when we come to consider inferring statistical ﬁndings from a sample to the population from which it was drawn.
465
2010 Elsevier Ltd. All rights reserved.
RESEARCH
12345
Variable
Figure 3 Histogram with superimposed “best ﬁt” frequency polygraph.
When we analyze the population distribution of the variable
of interest, we hope to discern the true value of that variable,
which lies somewhere within the distribution. As the frequency
distribution curve equates to a probability distribution of that variable, we can say that the true variable probably lies some
where within the central peak of a unimodal distribution. If our
curve has quite ﬂat kurtosis, then we can be less sure of this and
then the question arises as to how far away from this peak do we
think this true value may lie. This neatly brings us to the two
descriptive concepts used to understand where this true value
may lie e measures of central tendency, which hope to deter
mine the most typical value, and measures of variability, which
describe dispersion of the variable across the population.
Measures of central tendency
The appropriate selection of this measure depends on both the scale of measurement and the sample size.
Mean: this is the arithmetic average and equates to the sum of all observations divided by the number of observations. Although
Variable
Variable
Variable
Figure 4 Frequency polygraphs demonstrating a positively skewed distribution, b negatively skewed distribution and c distribution with symmetrical variance.
466
2010 Elsevier Ltd. All rights reserved.
RESEARCH
means can be generated for ordinal variables, they are more appropriate when describing continuous variables.
Median: this is the value that comes halfway when the data are ranked in order. If there is an even number of observations, then the value falls between the central two scores. This is a more appropriate measure in ordinal data where ranks exist but we cannot be sure of the relationship between different ranks. Medians are also useful when we are measuring a continuous variable but our distribution is skewed. The outliers in the tail of skewed data tend to exert a greater effect on the mean rather than the median.
Mode: this is the most frequently occurring observation and is usually reserved for nominal data where other measures of central tendency are not appropriate. Analogous to the various scales of measurement, as we move down this list from mean to mode a decreasing amount of consideration and calculation is required to arrive at the measure of central tendency. Consequently, statistical tests based on means rather than medians carry greater conﬁdence whereas few statistical applications utilize the mode.
Measures of variability
Let us now direct our attention to quantifying the variability or dispersion of scores within our population.
Range: this is simply the lowest and highest value and may be an unsatisfactory descriptor in the presence of extreme values or “outliers.”
Centiles: these values encompass most but not all of the data in an attempt to negate the effects of extreme outliers and are thus more suitable for skewed data. A centile is any value below which a given percentage of the values occur e.g. the 90th centile encloses the ﬁrst 90% of values. The median lies at the 50th centile. Intercentile ranges are usually used when the median is the most appropriate measure of central tendency e.g. the inter quartile range extends from the 25th to the 75th centiles. These are usually graphically depicted as box and whisker plots (Figure 5).
Standard deviation: this is a representation of how the various scores in the population are dispersed quantitatively relative to the mean and are used for continuous data. It is a function of the variance which equates to the arithmetic mean of the squares of the difference of each score from the population mean. If we obtain a square root of the variance, a necessary prerequisite to obtain a measure in the same units as our population scores, we arrive at the standard deviation.
Are our data normally distributed?
Gaus, a famous mathematician born in 1777, discovered that most continuous variables when depicted on a frequency poly gram assumed a particular distribution that has come to be known as the Gaussian or normal distribution ^{5} (Figure 6). There are certain key properties of this distribution which deﬁne its shape:
Figure 5 Box and whisper plot for a dataset.
1. Unimodality e the distribution has a solitary peak equating to the mode. 2. Symmetrical variance e the distribution is symmetrical about the peak, in essence equating the mode to the mean and the median. 3. Deﬁned kurtosis e 68% of observations lie within 1 standard deviation of the mean and 95% of observations lie within 2 standard deviations of the mean. The distribution functions as a predetermined probability density function allowing us to use statistical tests derived from this function. These are known as parametric tests. Parameters are characteristics used to describe population distributions. Para metric data imply that the data are normally distributed. These tests rely on fairly ﬁrm assumptions regarding these parameters and are usually based on the sample means. Consequently, we can be fairly conﬁdent in the robustness of the ﬁndings they generate. All naturally occurring continuous variables when plotted as a population distribution assume these parameters. Statistical tests e.g. the Shapiro Wilkinson test, can be under taken to deﬁne the extent of normality of a dataset but this can be done more easily by plotting the histogram or a normal plot to assess the shape of the distribution. Often we are not looking at the variable in the entire pop ulation but rather a sample of it. Sampling error relates to the discrepancy between the sample characteristics and the pop ulation characteristics and if this is large the distribution of our sample data may not be normal. These sample data can be described as nonparametric. The analysis of nonparametric data can be undertaken in three ways:
Figure 6 The properties of the normal distribution.
467
2010 Elsevier Ltd. All rights reserved.
RESEARCH
1. Nonparametric statistical analysis e generally the results of such tests contain a greater degree of uncertainty than their parametric equivalents. This is due to the lack of assump tions regarding the distribution and such tests are based on ranks and medians rather than continuous data and means. 2. Parametric statistical analysis e if we know that the pop ulation from which the sample is drawn is normally distributed we can use parametric tests on the sample data even though they may not be normally distributed. However, often we may not be certain or able to prove that the pop ulation data are normally distributed. 3. Transformation into normalized data e linear trans formations such as multiplication or subtraction may be insufﬁcient to normalize a dataset, necessitating nonlinear methods, for example logarithms, with the drawback of increasing the complexity of results interpretation. Other types of distribution exist, which act as the basis for statistical tests for nonnormal data. ^{6} The binomial distribution is based on the relative frequencies of all possible permutations and combinations of discrete data. The Poisson distribution is usually applied to discrete quantitative data, such as counts or incidences occurring over a period of time, for example the number of patients undergoing hip fracture surgery per day in a particular hospital. The ‘t’ distribution is a theoretical distribution derived from the normal distribution with an additional parameter of degrees of freedom, which determines how long the tails of the distribution are. An increase in the sample size increases the degrees of freedom, thereby shortening the tails and bringing the distribution closer to normal. Statistical tests for small samples have been derived from this distribution.
Statistical tests for parametric and nonparametric data
So far we have looked at descriptive statistics, but in order to generate conclusions through inferential statistics we need to be able to test one or more hypotheses. A hypothesis is a statement of fact generated by a research question. For example, we may be interested in the degree of association between two continuous variables, the difference between outcomes of two groups or the level of agreement between two observers for one variable. The choice of appropriate test does not just rely on the research question we are addressing but also the previously mentioned data properties with respect to scale of measurement and distribution.
Comparing two independent groups
Many studies involve comparing two groups e either different interventions or exposed versus nonexposed. The type of data and the distribution determine the appropriate statistical test
(Figure 7).
Comparing paired data
Occasionally two groups are wrongly compared using the above analysis. This is the case when we are looking at paired data such as repeated measures before and after an intervention. If we have
continuous data, we are interested in the mean of the differences between successive readings rather than the difference in the means between the two groups; effectively reducing the data to a one sample problem. We then need to ascertain whether the distribution of the differences is normal rather than considering normality of the original two samples when considering whether to undertake the parametric test or its nonparametric equivalent (Figure 8).
Comparing more than two groups
When we are comparing more than two groups we have two options (Figure 9). Either we can perform multiple tests for independent groups or more preferably we can use a one way analysis of variance (ANOVA). This is a parametric test which simultaneously compares all groups’ means on the basis that all of the groups are normally distributed with the same variance. If our data are nonparametric then we need to use a different test.
Correlating the results of two groups There are many instances when we are trying to show an asso ciation between two variables. If we are trying to show that two variables correlate then we should look at using a correlation test (Figure 10), which can be useful in determining causality, concurrent validity and internal consistency. However, one must always be aware of spurious correlations with the passage of time, such as the price of butter increasing with the birth rate due to temporal trends rather than any meaningful relationship between the two. A further point to note is that correlation and association do not equal causality. Establishing causality of A causing B, we have to prove that A always precedes B, A and B correlate and that if A is absent B cannot occur. There are two tests for correlation depending on whether our numerical data are parametric or nonparametric. This yields a value between 1 (negative correlation) to þ1 (positive
Comparing two independent groups
Fishers exact test
Independent t test
Mann Whitney test
Figure 7 Statistical tests for comparing two independent groups.
468
2010 Elsevier Ltd. All rights reserved.
RESEARCH
Comparing paired data
rank test
Figure 8 Statistical tests for paired data.
correlation) with zero equating to no correlation. These results indicate the measure of scatter of the data around a best ﬁt line when the two variables are plotted against each other. However, these tests can only be used if we expect a linear correlation. Regression models deepen our knowledge of association by describing and quantifying the relationship between two vari ables. These models can also be used for nonlinear relation ships. However, our analysis will be dictated by which variables we assign as predictor/independent variables and which ones we assign as outcome/dependent variables. The validation of diagnostic tools requires demonstration of intra and interobserver agreement ^{6} e in essence, establishing that the results of the test are independent of the observer or extrinsic circumstances at the time of measurement. If we are comparing quantitative scales we can calculate the standard deviation of differences or the coefﬁcient of variation. However, if the data are categorical then we can generate a k statistic, which measures the exact number of agreements occurring in excess of those expected by chance. A value of 1 indicates perfect agreement whereas less than 0.2 is poor agreement.
Hypothesis testing as the basis of statistical tests interpretation
We have learnt how to correctly utilize a statistical test based on the scale of measurement and distribution of data we have in conjunction with the research question we are trying to assess. However, to correctly interpret the results we have to understand the basis for these tests. The statistical tests described above all act to test hypotheses. For example, if we were to compare the results of two groups A and B, the possible hypotheses are:
Hypothesis 1 (H1): group A is better than group B i.e. differ ence in one direction (alternative hypothesis)
Hypothesis 2 (H2): group B is better than group A i.e. differ ence in opposite direction (alternative hypothesis) Hypothesis 0 (H0): there is no difference between the groups
i.e. effect of interest ¼ zero (null hypothesis) H1 and H2 are examples of alternative hypotheses, which is what studies are usually interested in. However, statistical tests work on the basis of deriving a probability, a ‘p’ value, of accepting or rejecting the null hypothesis, which is the exact opposite of the alternative hypotheses. The null hypothesis always deﬁnes the effect of interest as zero or nonexistant. p values are often quoted as the probability that the results obtained were due to chance, which is an incorrect oversimpliﬁcation. The correct deﬁnition is:
“the probability of obtaining the observed difference, or one more extreme, given the null hypothesis is true” ^{7} Simply, the p value is a probability statement about the likeli hood of the statistical observation, or one more extreme, given that the effect of interest is zero. If the p value is high we cannot rule out the null hypothesis, whereas, if it is small we rule out the null hypothesis and thus favour the alternative hypothesis. The question arises as to what magnitude of p value can be regarded as small. Arbitrarily, the consensus opinion is for a cut off value of 5% i.e. a p value of less than 0.05 is deemed statistically signiﬁcant with respect to rejecting the null hypothesis. However, hypothesis testing, and by extension p values, can produce errors in analysis. For example, if we are comparing two treatments in a clinical trial we may demonstrate a difference between the two groups with a signiﬁcant p value and thus reject the null hypothesis when it may actually be true (type 1 error). Conversely, we may accept the null hypothesis that there is no difference between the two when this is actually false (type II error). These scenarios are depicted in Table 1. We can see that a type I error results from errors in measurement causing us to detect a difference when this is not actually present and represents errors in either the experimental technique or the level at which we have set signiﬁcance. The converse situation, type II error, occurs when we have failed to demonstrate a difference that exists. This may also relate to experimental ﬂaws but may also be as a result of sampling error. We have already mentioned that often we cannot study the entire population of interest and we have to take a representative sample, which if randomly drawn should share all identiﬁable and unidentiﬁable variables with the target population. The extent to which it does this is known as the sampling error. We
Comparing > 2 groups
Multiple independent t/Mann Whitney tests
Bonferroni correction
Analysis of variance
Figure 9 Statistical tests for comparing more than two groups.
469
2010 Elsevier Ltd. All rights reserved.
RESEARCH
Correlating results of different groups
coefficient
Spearman Ro
Figure 10 Tests of correlation of two or more groups.
can see that as the sampling size increases, the representation and this error must decrease. Therefore, our ability to detect a meaningful difference between two groups is highly dependent on sample size.
Power calculations
We can see that larger studies are more powerful with respect to their ability to detect a meaningful difference and thus reject the null hypothesis. The power of a study is deﬁned as:
“the probability that a study of a given size will register as statistically signiﬁcant a real difference of a given magnitude.” ^{8} In essence, a power calculation allows us to determine the sample size required to actually register a speciﬁc magnitude of effect. However, to determine this, we ﬁrst need to establish the smallest true clinical or experimental effect that we consider as meaningful. In other words if this magnitude of effect exists we can state that there is a difference between the two groups. The other important variables are the probability (b) of our study detecting this magnitude of effect and the level of statistical signiﬁcance (a). These are typically set at 80% and 5% respectively but can be set at any level. The power calculation is based on these variables and assumes independent groups with roughly equal sample sizes that have normal distributions. A power calculation is likely to produce a very large sample size if the clinical effect, or difference between two groups we are measuring, is very small or the variance is very large. Conversely, if we reduce the power or increase the level of signiﬁcance then our sample size may be smaller but this is at the cost of increasing the risks of type II and type I errors respectively.
Problems with p values and hypothesis testing
Hypothesis testing is sound in principle, but restricting analysis to the interpretation of p values alone has signiﬁcant drawbacks. Let us consider some theoretical examples of how this may occur. A new perioperative regime to optimize the care of lower limb arthroplasty patients is introduced in order to reduce the inpatient stay and associated costs of treatment. A study is carried out looking at the inpatient stay of this population before and after this intervention is introduced. An analysis comparing the before and after groups demonstrates a statistically signiﬁ cant difference, with a p value less than 0.05. The new inter vention is heralded as a success and implemented. However, on closer scrutiny it can be seen that the actual difference between the two groups is less than 1 day which is not clinically signiﬁ cant. The costs of implementing this intervention have actually exceeded the ﬁnancial beneﬁt in terms of reduced stay. Very small differences can become statistically signiﬁcant if a large enough sample size is used. Occasionally multiple tests are carried out with datasets, either purposefully or in the vain hope of demonstrating a statistically signiﬁcant relationship. As signiﬁcance relates to probability, the chances are that the greater the number of tests undertaken, the more likely you are to come up with a signiﬁcant p value. Although multiple testing should be avoided, by either the appropriate limitation of tests to those speciﬁed in advance by the research question or through use of regression models, occasionally it may be unavoidable. In these instances the easiest correction is to undertake a Bonferroni transformation, whereby the statistically signiﬁcant p value is multiplied by the number of tests undertaken to see if it is still signiﬁcant.
Table 1
470
2010 Elsevier Ltd. All rights reserved.
RESEARCH
Another limitation to hypothesis testing can be exempliﬁed by considering a theoretical study comparing a new thrombopro phylactic drug with an old one, which demonstrates a ﬁvefold reduction in the incidence of postoperative thromboembolic events. However, the p value generated is 0.15 which is deemed nonsigniﬁcant. On this basis, no further studies are undertaken and plans to replace the old drug with the new one are indeﬁ nitely shelved. This theoretical example highlights the very real risk that we may dismiss effective interventions on the basis of statistical signiﬁcance alone. A high p value merely suggests that there is insufﬁcient evidence to reject the null hypothesis. Arbi trarily accepting the null hypothesis due to setting signiﬁcance at a particular level implies that we have found no proof of differ ence. However, “no proof of difference” does not equate to “proof of no difference” ^{9} e a fundamental consideration when interpreting p values. Judging a p value by setting signiﬁcance at a particular level effectively reduces the answer to any research question to a yes/no status. It is more informative to look at the p value itself and the probability implications rather than looking at arbitrary cut offs. Unfortunately, this relative ease of under standing has led to publication bias towards signiﬁcant results. ^{1} The analysis of data should focus on characterizing the actual study effect under consideration rather than looking at proba bility statements alone.
Estimation and conﬁdence intervals
Hypothesis testing asks us the question “is it? or isn’t it?”. What we are actually interested in is the answers to “how big is the difference?” and “in what direction is the difference?”. As our results encompass uncertainty we have to rely on methods to estimate where the true value of interest lies. As we have seen earlier we can make point estimates based on measures of central tendency, such as the arithmetic mean. However, these do not take into account the variability or dispersion of the data relative to this value. Of greater interest is the estimation of an interval that we can be conﬁdent encloses the unknown true population value. These are known as conﬁdence intervals and encompass an estimate of the true value (for example the arithmetic mean) as well as the sampling variability, with some level of assurance or conﬁdence. Any conﬁdence interval can be constructed, but by convention 95% conﬁdence intervals are usually derived. It is important to note that conﬁdence intervals are not direct prob ability statements. To state that a 95% conﬁdence interval means that there is a 95% probability that the true population value of interest lies within this range is false. What it actually means is
that if we took 100 similar sized samples from the population and derived 95% conﬁdence intervals for these samples, then 95 of these intervals would contain the true population value i.e. 95% of 95% similar sized conﬁdence intervals will contain the true value. ^{6} The equation for deriving a 95% conﬁdence interval is given by the formula. ^{7}
95% CI ¼ c 1:96ðd=OnÞ
where c ¼ mean, d¼ standard deviation, n ¼ sample size. We can see from the above formulation that if our standard deviation i.e. the variance and/or the sample size is small then our conﬁdence interval and thus our magnitude of uncertainty is also large. The width of conﬁdence intervals decreases with increasing sample size but it is always advantageous to compare intervals, no matter how large, rather than point estimates. An example of comparing conﬁdence intervals to means is as follows e consider a theoretical study where the mean risk of developing a complication with Operation A compared to Oper ation B is fourfold with a conﬁdence interval of 0.6 to 9.7. The means alone suggest that the risk is higher in Operation A. However, looking at the conﬁdence interval we can see that it encloses 1 i.e. equivalence. Therefore if this is the true difference then we can state that there is equivalent risk with both opera tions. There is also the possibility that there is less risk with Operation A because the interval encloses 0.6. This ﬁnding is tempered by the fact that the conﬁdence interval extends in the opposite direction to a greater than ninefold risk with Operation A. The reader is thus endowed with greater knowledge regarding the difference between the two groups which is far in excess of that provided by a p value alone. For this reason, statistical analysis should always look at conﬁdence intervals so as to show the direction and magnitude of effect. Only then can we deter mine whether a statistically signiﬁcant p value actually has clinical signiﬁcance. However we must always remember that conﬁdence intervals like p values are not immune to errors in study design or bias and that the intervals should always be regarded as the smallest estimate for the real error in deﬁning the true population value.
Diagnostic tests
When we are looking at diagnostic tests for conditions, we need to be aware of certain statistical deﬁnitions. Usually we are interested in comparing the results of a new diagnostic test with an established reference test or standard for diagnosing the condition. If we plot a two by two table for all possible
Different outcomes for diagnostic test compared to reference standard 

Disease/reference standard Present 
Absent 
Total 

Test result 
Positive 
a (true positives) 
b (false positives) 
a þ b 

Negative 
c (false negatives) 
d (true negatives) 
c þ d 

Total 
a þ c 
b þ d 
n 

Table 2 
471
2010 Elsevier Ltd. All rights reserved.
RESEARCH
Common mistakes in statistical analysis and interpretation 

Planning/study design 









Data analysis 















Results interpretation 





intervals 


Table 3
permutations of results of these two tests we can deﬁne several measures of the efﬁcacy of our diagnostic test as shown ^{6} (Table 2).
Sensitivity ¼ Proportion of positive results ðor patients who have the conditionÞ that are correctly identified by the test
¼ a=a þ c ðif highly sensitive test can be used to rule condition outÞ
Specificity ¼ Proportion of negative results ðor patients without the conditionÞ that are correctly identified by the test ¼ d=b þ d
Positive predictive value ¼ Proportion of patients with a positive test result who are correctly diagnosed ¼ a=a þ b
Negative predicitve value ¼ Proportion of patients with a negative test result who are correctly diagnosed ¼ d=c þ d
Conclusion
In summary we can see that statistical tests can only be mean ingfully and appropriately applied if we understand the proper ties of our dataset as well as basing our analysis on a suitable research question. However, the statistical tests themselves are only half the battle; the remainder being how to interpret the results to generate credible ﬁndings. Table 3 summarizes the common errors in both analysis and interpretation to conclude this discourse and serve as a reminder of the main points to consider before embarking on statistical analysis. _{A}
REFERENCES
1 Hopewell S, Loudon K, Clarke MJ, Oxman AD, Dickersin K. Publication bias in clinical trials due to statistical signiﬁcance or direction of trial results. Cochrane Database Syst Rev; 2009; Issue 1.
2 Qureshi AA, Ibrahim T. Study design in clinical orthopaedic trials. Orthop and Trauma 2010; 24: 229e40.
3 Ballentine LE. The statistical interpretation of quantum mechanics. Rev Mod Phys 1970; 42: 358e81.
4 Kirkwood BR, Sterne JAC. Essential medical statistics. 2nd edn. Blackwell, 2003.
5 Altman DG, Bland JM. The normal distribution. BMJ 1995; 310: 298.
6 Bland M. An introduction to medical statistics. 3rd edn. Oxford University Press, 2000.
7 Campbell MJ, Machin D. Medical statistics: a commonsense approach. 3rd edn. Wiley, 1999.
8 Altman DG. Statistics and ethics in medical research. III How large a sample? BMJ 1980; 281: 1336e8.
9 Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ 1995; 311: 485.
472
2010 Elsevier Ltd. All rights reserved.
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.