Sie sind auf Seite 1von 10


Statistical tests in orthopaedic research

A A Qureshi T Ibrahim


An understanding of the basic principles of statistical analysis is vital before commencing research. This article aims to provide a concise over- view of this extensive subject, highlighting the important concepts. Statis- tical analysis should be considered at the planning stage of any study so as to establish hypotheses, specify the primary outcome of interest and undertake a sample power calculation. The research question, scale of measurement and distribution of the outcome variable all have a bearing on the appropriate choice of statistical test. A statistical test can only be employed if the distribution assumptions of the test have been met. The interpretation of significance must be tempered by limitations of the method of analysis, as well as recognizing the variability of the effect of interest using an interval estimate. The various descriptive statistics in diagnostic studies are also explored.

Keywords confidence intervals; p values; power calculation; statistics

Introduction e the importance of statistics in clinical practice

A fundamental understanding of statistical analysis is a neces- sary pre-requisite to undertaking clinical research. Despite this, many otherwise well designed studies are let down by poor analysis and incorrect application of tests as an unfortunate consequence of insufficient knowledge or attention being devoted to this vital part of the research. To a certain extent this reflects deficiencies amongst clinicians in understanding and implementing statistical methods. Although importance is attached to study design, critical analysis of research based on the appropriate use of statistics is often suboptimal. All too frequently an assigned p value assumes overwhelming impor- tance in the results of a study and has been demonstrated to be a source of publication bias. 1 This may have significant health- care implications if ineffective, costly treatments are adopted whilst beneficial interventions are marginalized despite evidence that is not capable of standing up to scientific scrutiny. The demands generated by the ever increasing development of medical technologies cannot be met by finite healthcare resources. The appropriate utilization of resources alongside

A A Qureshi MB BS MSc MRCS Specialist trainee in Orthopaedic Surgery, University Hospitals of Leicester, Leicester Royal Infirmary, Leicester LE1 5WW, UK.

T Ibrahim MB BS(Hons.) MD FRCS(Tr & Orth) Clinical Lecturer in Orthopaedic Surgery, University of Leicester, Leicester General Hospital, Leicester LE5 4PW, UK.

improved patient care are the anticipated sequelae of evidence based practice. Statistical science will always remain an essential step in the progression of this paradigm. The purpose of this article is to highlight the basic principles of statistical analysis in orthopaedic research without reference to complex mathematical theorems or scientific proofs. The appropriate use of statistics is intimately related to the major considerations in study design and ultimately is driven by the research question. Thus, it is crucial that statistical analysis is considered at an early stage in the inception of a study as this can help to avoid several potential pitfalls later on. Although preliminary discussion with a statistician is beneficial to a study, it is no substitute for a basic grounding in statistical methods amongst the trial developers. Important concepts and consider- ations relating to the design of studies have been covered in our previous article. 2 The intention of this article is to deliver a concise overview of how statistics can be appropriately utilized to generate robust findings from a study. This understanding is twinned with the acquisition of skills enabling critical analysis of the interpretation and presentation of results in scientific papers, ultimately endowing the reader with the insight to question the legitimacy of the conclusions drawn from any study they read.

Why are statistics necessary?

Scientific reasoning has traditionally involved considering enti- ties as discrete and absolute; where measurements are unwa- vering despite endless repetition or altered circumstances. However, even on the smallest conceivable scale of observation this perspective has shifted. The birth of quantum mechanics in the late 19th century arose from the realization that phenomena involving electrons could not be explained in terms of classic mechanics. 3 Heisenberg’s uncertainty principle proposed that an electron’s spatial location was best understood as existing within a cloud of probability where a precise location was “uncertain.” This uncertainty arises from the understanding that there are countless factors exerting varying magnitudes of influence on the behaviour of this smallest of species. As the complexity of the phenomenon of interest increases, the number of governing factors and the accompanying uncertainty must doubtlessly increase. Thus it can be seen that when biological systems are

subjected to scientific observation, the one intended true measure of a variable is rarely observed due to variation in the phenomenon of interest as a result of the complex interplay of competing influences. In essence, this is why we call these measured properties variables e because they vary and this variation is often described as random. Usually we are investi- gating the effect of altering a variable, known as the independent variable, on the behaviour of an outcome or dependent variable. With the understanding that all variables are subject to variation, the extent of which is related to the number of determining factors, we can elevate our thinking to consider the following points:

Can we quantify the observed variation for a particular dependent variable? e.g. What is 10 year survivorship of a specific total hip replacement? Can we determine which independent variables are important in determining the extent of this variation?


2010 Elsevier Ltd. All rights reserved.


e.g. Age of patient, BMI, postoperative infection, length of inpatient stay e which of these variables affect survivorship? Can we determine the direction of effect when we change one of the independent variables? e.g. Is an increased body mass index associated with enhanced survivorship or early failure? Can we determine the magnitude of effect when we change one of the independent variables? What is the difference in survivorship in number of years between a patient aged 60 years and a patient aged 70 years? Can we formulate a predictive model to explain the variation of the dependent variable? By knowing the patient age, BMI, and the nature of any postoperative complications can we predict the length of survivorship of the implant? Statistical science attempts to address such questions. The most comprehensive definition of statistics is given by Kirk- wood 4 as “the science of collecting, summarizing, presenting, and interpreting data, and the using of them to test hypoth- eses.” It is logical to surmise that the quality of statistical analysis is directly related to the quality of the data and by extension of the study itself. Effective study design is the crucial foundation of research. Well formulated statistical analysis can be rendered redundant by a poorly executed study, whereas a well designed study marred by poor statis- tical analysis can still be redeemed by repeat analysis. The research question itself drives the study design and in turn the analysis. Deficiencies in the planning stage are one of the main reasons for poor studies. These elements may be considered as sequential mechanisms within the research engine (Figure 1). The engine will not start if one of the early gears has failed even if the subsequent ones have been furnished to perfection. Conversely, poor statistical analysis in a well designed study will permit limited conclusions. The importance of study design will not be dealt with here, having been covered in our previous article. Let us instead consider that we have finished a study, gathered the results and are now looking to analyze the data.

What type of data do we have?

Correct statistical analysis depends on the scale of measurement of the variables, as this determines the distribution of the data and the appropriate statistical tests. Data generated from studies can be broadly split into two scale types e categorical which is fundamen- tally qualitative and does not possess numerical properties and numerical data in which quantitative analysis is possible (Figure 2).

Categorical data

This can be of two types:

Nominal: the sole property is that the data can be named, enabling distinction. For example e hair colour, blood group, type of hip prosthesis. The only further level of differentiation is that of equivalence. No ranking of this dataset is possible.

Ordinal: although different categories exist, it is possible to rank these data in a specific order. An example is the Likert scale used in questionnaires whereby patients register their agreement to a statement by choosing one of five categories e strongly agree, agree, neither agree nor disagree, disagree and strongly disagree. Although these responses can be ranked in an order of increasing agreement, the relationships between different ranks cannot be precisely defined. Although such scales generate numerical data we should not treat this as quantitative data e.g. comparing the sums of scores on such questionnaires.

Numerical data

These data are quantitative and two scales of measurement are possible.

Discrete: these data consists of counts or frequencies. The vari- able can only assume a finite number of possibilities where in- between values do not exist e.g. number of operations.

Continuous: measurements can take any value within a specified range. Examples include the SI units of mass or distance. A further subclassification is possible:

Planning Study Design Execution Data Gathering Statistics Data analysis Statistics Data interpretation Presentation Publication
Study Design
Data Gathering
Data analysis
Data interpretation

Figure 1 Statistical analysis in the research sequence.


2010 Elsevier Ltd. All rights reserved.


Discrete Continous Nominal Ordinal Interval Ratio Variables Categorical Numerical

Figure 2 Scale of measurement of different variables.

Interval e the difference between measurements (interval) has meaning but the ratio does not. Examples include relative measures e.g. measuring walking distance after 1 km. We can say that a patient who scores 300 m (actually 1300 m) has walked 150 m more than a patient who has walked 150 m (actually 1150 m) but we cannot say that the first patient has walked twice the distance of the second. Ratio e these data have the added benefit that the ratio between values as well as the interval carries meaning. Taking the above example, measuring the actual rather than the relative walking distance gives us ratio data. This scale is only possible if the value of zero has a true meaning. As we progress down this list of scale types from nominal to continuous, a greater extent of information is engendered by the data. We can simplify data from continuous to ordinal by grouping values together but such compression leads to loss of information; transforming the distribution of the data limits the statistical tests that we can use.

How are my data distributed?

Having determined our scale of measurement, we can now determine the distribution of the variable through graphical assessment. Plotting the distribution allows us to make judge- ments regarding the most typical value of our variable and the extent of spread or dispersion around this. If our dataset consists of categorical data, we can graphically compare counts or frequencies between the different groups on a bar chart or pie chart. From this we can determine the most commonly occurring score and the relative ratios between different scores. However, if our data have a continuous scale of measurement, we can construct a histogram (Figure 3). Histo- grams are not bar charts. They should be considered as frequency distribution charts with distinct mathematical prop- erties. The width of the bar corresponds to the actual limits of that class interval i.e. the range of values for which the cumu- lative frequency has been determined. Smaller class intervals better define the shape of the distribution and give an idea as to what may be happening to our variable between class intervals. This is why it is best not to compress data through widening the

class interval as this distorts the distribution. If we connect the midpoints of each bar, a best fit curve can be mapped to these points, known as a frequency polygraph, enabling interpolation of values between the class intervals (Figure 3). The shape of the frequency polygraph curve is very important. Curves are mathematical functions and statistical tests can be derived from these functions to infer the properties of a variable fitting a particular distribution. The statistical test is only valid if the dataset satisfies the assumptions of the appropriate distri- bution. Rather than determining the actual mathematical func-

tion of a distribution curve, a far easier approach is to look at its shape. Most curves form an approximate bell shaped distribu- tion, with a peak flanked by two variable tails which taper off to the outliers. There are three important aspects of the shape of the curve:

  • 1. Modality e a curve is unimodal if it has one peak i.e. one mode. A bimodal curve has two peaks and so on.

  • 2. Skewness e this relates to symmetry of the tails. A curve is positively skewed if most of the scores are clustered at the higher end of the spectrum making this tail larger. A nega- tively skewed distribution is the converse of this with most of the scores being low (Figure 4).

  • 3. Kurtosis e this describes how flat the curve is. In essence, how the data are distributed about the peak.

How do we describe our dataset?

Let us now consider the frequency distribution as a visual depiction of the variable of interest in a population rather than within a dataset. The population does not have to consist of individuals e it is simply a set of occurrences of that variable. If we can measure each and every occurrence of a particular vari- able we will have a frequency distribution of the population. In most cases, we are unable to obtain measurements for the entire population and restrict ourselves to a representative sample. Through random sampling, this sample should embody all of the characteristics of the population of interest. This is an important point which will be explored later when we come to consider inferring statistical findings from a sample to the population from which it was drawn.


2010 Elsevier Ltd. All rights reserved.


5 0 15 25 10 20 Frequency



Figure 3 Histogram with superimposed “best fit” frequency polygraph.

When we analyze the population distribution of the variable

of interest, we hope to discern the true value of that variable,

which lies somewhere within the distribution. As the frequency

distribution curve equates to a probability distribution of that variable, we can say that the true variable probably lies some-

where within the central peak of a unimodal distribution. If our

curve has quite flat kurtosis, then we can be less sure of this and

then the question arises as to how far away from this peak do we

think this true value may lie. This neatly brings us to the two

descriptive concepts used to understand where this true value

may lie e measures of central tendency, which hope to deter-

mine the most typical value, and measures of variability, which

describe dispersion of the variable across the population.

Measures of central tendency

The appropriate selection of this measure depends on both the scale of measurement and the sample size.

Mean: this is the arithmetic average and equates to the sum of all observations divided by the number of observations. Although

0 50 a 10 30 20 40 Frequency 60 40 80 20 0


0 10 20 30 40 b Frequency 3 4 2 5 1


5 0 c 15 25 10 20 Frequency 3 4 2 5 1


Figure 4 Frequency polygraphs demonstrating a positively skewed distribution, b negatively skewed distribution and c distribution with symmetrical variance.


2010 Elsevier Ltd. All rights reserved.


means can be generated for ordinal variables, they are more appropriate when describing continuous variables.

Median: this is the value that comes halfway when the data are ranked in order. If there is an even number of observations, then the value falls between the central two scores. This is a more appropriate measure in ordinal data where ranks exist but we cannot be sure of the relationship between different ranks. Medians are also useful when we are measuring a continuous variable but our distribution is skewed. The outliers in the tail of skewed data tend to exert a greater effect on the mean rather than the median.

Mode: this is the most frequently occurring observation and is usually reserved for nominal data where other measures of central tendency are not appropriate. Analogous to the various scales of measurement, as we move down this list from mean to mode a decreasing amount of consideration and calculation is required to arrive at the measure of central tendency. Consequently, statistical tests based on means rather than medians carry greater confidence whereas few statistical applications utilize the mode.

Measures of variability

Let us now direct our attention to quantifying the variability or dispersion of scores within our population.

Range: this is simply the lowest and highest value and may be an unsatisfactory descriptor in the presence of extreme values or “outliers.”

Centiles: these values encompass most but not all of the data in an attempt to negate the effects of extreme outliers and are thus more suitable for skewed data. A centile is any value below which a given percentage of the values occur e.g. the 90th centile encloses the first 90% of values. The median lies at the 50th centile. Intercentile ranges are usually used when the median is the most appropriate measure of central tendency e.g. the inter- quartile range extends from the 25th to the 75th centiles. These are usually graphically depicted as box and whisker plots (Figure 5).

Standard deviation: this is a representation of how the various scores in the population are dispersed quantitatively relative to the mean and are used for continuous data. It is a function of the variance which equates to the arithmetic mean of the squares of the difference of each score from the population mean. If we obtain a square root of the variance, a necessary pre-requisite to obtain a measure in the same units as our population scores, we arrive at the standard deviation.

Are our data normally distributed?

Gaus, a famous mathematician born in 1777, discovered that most continuous variables when depicted on a frequency poly- gram assumed a particular distribution that has come to be known as the Gaussian or normal distribution 5 (Figure 6). There are certain key properties of this distribution which define its shape:

RESEARCH means can be generated for ordinal variables, they are more appropriate when describing continuous variables.ORTHOPAEDICS AND TRAUMA 24:6 467 2010 Elsevier Ltd. All rights reserved. " id="pdf-obj-4-37" src="pdf-obj-4-37.jpg">

Figure 5 Box and whisper plot for a dataset.

1. Unimodality e the distribution has a solitary peak equating to the mode. 2. Symmetrical variance e the distribution is symmetrical about the peak, in essence equating the mode to the mean and the median. 3. Defined kurtosis e 68% of observations lie within 1 standard deviation of the mean and 95% of observations lie within 2 standard deviations of the mean. The distribution functions as a predetermined probability density function allowing us to use statistical tests derived from this function. These are known as parametric tests. Parameters are characteristics used to describe population distributions. Para- metric data imply that the data are normally distributed. These tests rely on fairly firm assumptions regarding these parameters and are usually based on the sample means. Consequently, we can be fairly confident in the robustness of the findings they generate. All naturally occurring continuous variables when plotted as a population distribution assume these parameters. Statistical tests e.g. the Shapiro Wilkinson test, can be under- taken to define the extent of normality of a dataset but this can be done more easily by plotting the histogram or a normal plot to assess the shape of the distribution. Often we are not looking at the variable in the entire pop- ulation but rather a sample of it. Sampling error relates to the discrepancy between the sample characteristics and the pop- ulation characteristics and if this is large the distribution of our sample data may not be normal. These sample data can be described as non-parametric. The analysis of non-parametric data can be undertaken in three ways:

RESEARCH means can be generated for ordinal variables, they are more appropriate when describing continuous variables.ORTHOPAEDICS AND TRAUMA 24:6 467 2010 Elsevier Ltd. All rights reserved. " id="pdf-obj-4-50" src="pdf-obj-4-50.jpg">

Figure 6 The properties of the normal distribution.


2010 Elsevier Ltd. All rights reserved.


1. Non-parametric statistical analysis e generally the results of such tests contain a greater degree of uncertainty than their parametric equivalents. This is due to the lack of assump- tions regarding the distribution and such tests are based on ranks and medians rather than continuous data and means. 2. Parametric statistical analysis e if we know that the pop- ulation from which the sample is drawn is normally distributed we can use parametric tests on the sample data even though they may not be normally distributed. However, often we may not be certain or able to prove that the pop- ulation data are normally distributed. 3. Transformation into normalized data e linear trans- formations such as multiplication or subtraction may be insufficient to normalize a dataset, necessitating non-linear methods, for example logarithms, with the drawback of increasing the complexity of results interpretation. Other types of distribution exist, which act as the basis for statistical tests for non-normal data. 6 The binomial distribution is based on the relative frequencies of all possible permutations and combinations of discrete data. The Poisson distribution is usually applied to discrete quantitative data, such as counts or incidences occurring over a period of time, for example the number of patients undergoing hip fracture surgery per day in a particular hospital. The ‘t’ distribution is a theoretical distribution derived from the normal distribution with an additional parameter of degrees of freedom, which determines how long the tails of the distribution are. An increase in the sample size increases the degrees of freedom, thereby shortening the tails and bringing the distribution closer to normal. Statistical tests for small samples have been derived from this distribution.

Statistical tests for parametric and non-parametric data

So far we have looked at descriptive statistics, but in order to generate conclusions through inferential statistics we need to be able to test one or more hypotheses. A hypothesis is a statement of fact generated by a research question. For example, we may be interested in the degree of association between two continuous variables, the difference between outcomes of two groups or the level of agreement between two observers for one variable. The choice of appropriate test does not just rely on the research question we are addressing but also the previously mentioned data properties with respect to scale of measurement and distribution.

Comparing two independent groups

Many studies involve comparing two groups e either different interventions or exposed versus non-exposed. The type of data and the distribution determine the appropriate statistical test

(Figure 7).

Comparing paired data

Occasionally two groups are wrongly compared using the above analysis. This is the case when we are looking at paired data such as repeated measures before and after an intervention. If we have

continuous data, we are interested in the mean of the differences between successive readings rather than the difference in the means between the two groups; effectively reducing the data to a one sample problem. We then need to ascertain whether the distribution of the differences is normal rather than considering normality of the original two samples when considering whether to undertake the parametric test or its non-parametric equivalent (Figure 8).

Comparing more than two groups

When we are comparing more than two groups we have two options (Figure 9). Either we can perform multiple tests for independent groups or more preferably we can use a one way analysis of variance (ANOVA). This is a parametric test which simultaneously compares all groups’ means on the basis that all of the groups are normally distributed with the same variance. If our data are non-parametric then we need to use a different test.

Correlating the results of two groups There are many instances when we are trying to show an asso- ciation between two variables. If we are trying to show that two variables correlate then we should look at using a correlation test (Figure 10), which can be useful in determining causality, concurrent validity and internal consistency. However, one must always be aware of spurious correlations with the passage of time, such as the price of butter increasing with the birth rate due to temporal trends rather than any meaningful relationship between the two. A further point to note is that correlation and association do not equal causality. Establishing causality of A causing B, we have to prove that A always precedes B, A and B correlate and that if A is absent B cannot occur. There are two tests for correlation depending on whether our numerical data are parametric or non-parametric. This yields a value between 1 (negative correlation) to þ1 (positive

Comparing two independent groups

Categorical data Numerical data
Categorical data
Numerical data
>n=5 in all categories <n=5 in any category Chi squared test
>n=5 in all
<n=5 in any
Chi squared test
Non-parametric data Parametric data
Non-parametric data
Parametric data

Fishers exact test

Comparing two independent groups Categorical data Numerical data >n=5 in all categories <n=5 in any category

Independent t test

Comparing two independent groups Categorical data Numerical data >n=5 in all categories <n=5 in any category

Mann Whitney test

Figure 7 Statistical tests for comparing two independent groups.


2010 Elsevier Ltd. All rights reserved.


Comparing paired data

Non-parametric data Parametric data Paired t test Wilcoxon signed
Non-parametric data
Parametric data
Paired t test
Wilcoxon signed

rank test

Figure 8 Statistical tests for paired data.

correlation) with zero equating to no correlation. These results indicate the measure of scatter of the data around a best fit line when the two variables are plotted against each other. However, these tests can only be used if we expect a linear correlation. Regression models deepen our knowledge of association by describing and quantifying the relationship between two vari- ables. These models can also be used for non-linear relation- ships. However, our analysis will be dictated by which variables we assign as predictor/independent variables and which ones we assign as outcome/dependent variables. The validation of diagnostic tools requires demonstration of intra- and interobserver agreement 6 e in essence, establishing that the results of the test are independent of the observer or extrinsic circumstances at the time of measurement. If we are comparing quantitative scales we can calculate the standard deviation of differences or the co-efficient of variation. However, if the data are categorical then we can generate a k statistic, which measures the exact number of agreements occurring in excess of those expected by chance. A value of 1 indicates perfect agreement whereas less than 0.2 is poor agreement.

Hypothesis testing as the basis of statistical tests interpretation

We have learnt how to correctly utilize a statistical test based on the scale of measurement and distribution of data we have in conjunction with the research question we are trying to assess. However, to correctly interpret the results we have to understand the basis for these tests. The statistical tests described above all act to test hypotheses. For example, if we were to compare the results of two groups A and B, the possible hypotheses are:

Hypothesis 1 (H1): group A is better than group B i.e. differ- ence in one direction (alternative hypothesis)

Hypothesis 2 (H2): group B is better than group A i.e. differ- ence in opposite direction (alternative hypothesis) Hypothesis 0 (H0): there is no difference between the groups

i.e. effect of interest ¼ zero (null hypothesis) H1 and H2 are examples of alternative hypotheses, which is what studies are usually interested in. However, statistical tests work on the basis of deriving a probability, a ‘p’ value, of accepting or rejecting the null hypothesis, which is the exact opposite of the alternative hypotheses. The null hypothesis always defines the effect of interest as zero or non-existant. p values are often quoted as the probability that the results obtained were due to chance, which is an incorrect oversimplification. The correct definition is:

“the probability of obtaining the observed difference, or one more extreme, given the null hypothesis is true” 7 Simply, the p value is a probability statement about the likeli- hood of the statistical observation, or one more extreme, given that the effect of interest is zero. If the p value is high we cannot rule out the null hypothesis, whereas, if it is small we rule out the null hypothesis and thus favour the alternative hypothesis. The question arises as to what magnitude of p value can be regarded as small. Arbitrarily, the consensus opinion is for a cut off value of 5% i.e. a p value of less than 0.05 is deemed statistically significant with respect to rejecting the null hypothesis. However, hypothesis testing, and by extension p values, can produce errors in analysis. For example, if we are comparing two treatments in a clinical trial we may demonstrate a difference between the two groups with a significant p value and thus reject the null hypothesis when it may actually be true (type 1 error). Conversely, we may accept the null hypothesis that there is no difference between the two when this is actually false (type II error). These scenarios are depicted in Table 1. We can see that a type I error results from errors in measurement causing us to detect a difference when this is not actually present and represents errors in either the experimental technique or the level at which we have set significance. The converse situation, type II error, occurs when we have failed to demonstrate a difference that exists. This may also relate to experimental flaws but may also be as a result of sampling error. We have already mentioned that often we cannot study the entire population of interest and we have to take a representative sample, which if randomly drawn should share all identifiable and unidentifiable variables with the target population. The extent to which it does this is known as the sampling error. We

Comparing > 2 groups

Comparing > 2 groups Multiple independent t/Mann Whitney tests Bonferroni correction Analysis of variance Non-parametric Parametric

Multiple independent t/Mann Whitney tests

Comparing > 2 groups Multiple independent t/Mann Whitney tests Bonferroni correction Analysis of variance Non-parametric Parametric

Bonferroni correction

Analysis of variance

Non-parametric Parametric One way Kruskall-Wallis ANOVA test
One way

Figure 9 Statistical tests for comparing more than two groups.


2010 Elsevier Ltd. All rights reserved.


Correlating results of different groups

>2 groups and/or Non-linear relationship 2 groups and linear relationship Regression models Non-parametric Parametric Pearson’s correlation
>2 groups and/or
Non-linear relationship
2 groups and linear
Regression models
Pearson’s correlation


Spearman Ro

Figure 10 Tests of correlation of two or more groups.

can see that as the sampling size increases, the representation and this error must decrease. Therefore, our ability to detect a meaningful difference between two groups is highly dependent on sample size.

Power calculations

We can see that larger studies are more powerful with respect to their ability to detect a meaningful difference and thus reject the null hypothesis. The power of a study is defined as:

“the probability that a study of a given size will register as statistically significant a real difference of a given magnitude.” 8 In essence, a power calculation allows us to determine the sample size required to actually register a specific magnitude of effect. However, to determine this, we first need to establish the smallest true clinical or experimental effect that we consider as meaningful. In other words if this magnitude of effect exists we can state that there is a difference between the two groups. The other important variables are the probability (b) of our study detecting this magnitude of effect and the level of statistical significance (a). These are typically set at 80% and 5% respectively but can be set at any level. The power calculation is based on these variables and assumes independent groups with roughly equal sample sizes that have normal distributions. A power calculation is likely to produce a very large sample size if the clinical effect, or difference between two groups we are measuring, is very small or the variance is very large. Conversely, if we reduce the power or increase the level of significance then our sample size may be smaller but this is at the cost of increasing the risks of type II and type I errors respectively.

Problems with p values and hypothesis testing

Hypothesis testing is sound in principle, but restricting analysis to the interpretation of p values alone has significant drawbacks. Let us consider some theoretical examples of how this may occur. A new perioperative regime to optimize the care of lower limb arthroplasty patients is introduced in order to reduce the inpatient stay and associated costs of treatment. A study is carried out looking at the inpatient stay of this population before and after this intervention is introduced. An analysis comparing the before and after groups demonstrates a statistically signifi- cant difference, with a p value less than 0.05. The new inter- vention is heralded as a success and implemented. However, on closer scrutiny it can be seen that the actual difference between the two groups is less than 1 day which is not clinically signifi- cant. The costs of implementing this intervention have actually exceeded the financial benefit in terms of reduced stay. Very small differences can become statistically significant if a large enough sample size is used. Occasionally multiple tests are carried out with datasets, either purposefully or in the vain hope of demonstrating a statistically significant relationship. As significance relates to probability, the chances are that the greater the number of tests undertaken, the more likely you are to come up with a significant p value. Although multiple testing should be avoided, by either the appropriate limitation of tests to those specified in advance by the research question or through use of regression models, occasionally it may be unavoidable. In these instances the easiest correction is to undertake a Bonferroni transformation, whereby the statistically significant p value is multiplied by the number of tests undertaken to see if it is still significant.

Type I and type II errors True state of affairs Result of statistical test No effect/difference
Type I and type II errors
True state of affairs
Result of statistical test
No effect/difference detected
Effect/difference detected
Effect of interest/difference non-existant
Type I error (a)
Effect of interest/difference exists
Type II error (b)

Table 1


2010 Elsevier Ltd. All rights reserved.


Another limitation to hypothesis testing can be exemplified by considering a theoretical study comparing a new thrombopro- phylactic drug with an old one, which demonstrates a fivefold reduction in the incidence of postoperative thromboembolic events. However, the p value generated is 0.15 which is deemed non-significant. On this basis, no further studies are undertaken and plans to replace the old drug with the new one are indefi- nitely shelved. This theoretical example highlights the very real risk that we may dismiss effective interventions on the basis of statistical significance alone. A high p value merely suggests that there is insufficient evidence to reject the null hypothesis. Arbi- trarily accepting the null hypothesis due to setting significance at a particular level implies that we have found no proof of differ- ence. However, “no proof of difference” does not equate to “proof of no difference” 9 e a fundamental consideration when interpreting p values. Judging a p value by setting significance at a particular level effectively reduces the answer to any research question to a yes/no status. It is more informative to look at the p value itself and the probability implications rather than looking at arbitrary cut offs. Unfortunately, this relative ease of under- standing has led to publication bias towards significant results. 1 The analysis of data should focus on characterizing the actual study effect under consideration rather than looking at proba- bility statements alone.

Estimation and confidence intervals

Hypothesis testing asks us the question “is it? or isn’t it?”. What we are actually interested in is the answers to “how big is the difference?” and “in what direction is the difference?”. As our results encompass uncertainty we have to rely on methods to estimate where the true value of interest lies. As we have seen earlier we can make point estimates based on measures of central tendency, such as the arithmetic mean. However, these do not take into account the variability or dispersion of the data relative to this value. Of greater interest is the estimation of an interval that we can be confident encloses the unknown true population value. These are known as confidence intervals and encompass an estimate of the true value (for example the arithmetic mean) as well as the sampling variability, with some level of assurance or confidence. Any confidence interval can be constructed, but by convention 95% confidence intervals are usually derived. It is important to note that confidence intervals are not direct prob- ability statements. To state that a 95% confidence interval means that there is a 95% probability that the true population value of interest lies within this range is false. What it actually means is

that if we took 100 similar sized samples from the population and derived 95% confidence intervals for these samples, then 95 of these intervals would contain the true population value i.e. 95% of 95% similar sized confidence intervals will contain the true value. 6 The equation for deriving a 95% confidence interval is given by the formula. 7

95% CI ¼ c 1:96ðd=OnÞ

where c ¼ mean, d¼ standard deviation, n ¼ sample size. We can see from the above formulation that if our standard deviation i.e. the variance and/or the sample size is small then our confidence interval and thus our magnitude of uncertainty is also large. The width of confidence intervals decreases with increasing sample size but it is always advantageous to compare intervals, no matter how large, rather than point estimates. An example of comparing confidence intervals to means is as follows e consider a theoretical study where the mean risk of developing a complication with Operation A compared to Oper- ation B is fourfold with a confidence interval of 0.6 to 9.7. The means alone suggest that the risk is higher in Operation A. However, looking at the confidence interval we can see that it encloses 1 i.e. equivalence. Therefore if this is the true difference then we can state that there is equivalent risk with both opera- tions. There is also the possibility that there is less risk with Operation A because the interval encloses 0.6. This finding is tempered by the fact that the confidence interval extends in the opposite direction to a greater than ninefold risk with Operation A. The reader is thus endowed with greater knowledge regarding the difference between the two groups which is far in excess of that provided by a p value alone. For this reason, statistical analysis should always look at confidence intervals so as to show the direction and magnitude of effect. Only then can we deter- mine whether a statistically significant p value actually has clinical significance. However we must always remember that confidence intervals like p values are not immune to errors in study design or bias and that the intervals should always be regarded as the smallest estimate for the real error in defining the true population value.

Diagnostic tests

When we are looking at diagnostic tests for conditions, we need to be aware of certain statistical definitions. Usually we are interested in comparing the results of a new diagnostic test with an established reference test or standard for diagnosing the condition. If we plot a two by two table for all possible


Different outcomes for diagnostic test compared to reference standard


Disease/reference standard Present




Test result


a (true positives)

b (false positives)

a þ b



c (false negatives)

d (true negatives)

c þ d



a þ c

b þ d



Table 2


2010 Elsevier Ltd. All rights reserved.



Common mistakes in statistical analysis and interpretation


Planning/study design

  • C Absence of research question/hypotheses before data collection

  • C No sample size/power calculation

  • C No criteria specified for sample, i.e. inclusion/exclusion criteria

  • C Too many variables, primary outcome of interest not stated

Data analysis

  • C Compressing data/changing continuous data to ordinal data

  • C Using inappropriate measures of central tendency, e.g. means for skewed data

  • C Not assessing normality of frequency distribution

  • C Incorrectly applying parametric tests when assumptions not satisfied

  • C Paired data analyzed as independent groups

  • C Inappropriate methods of assessing agreement

  • C Inappropriate multiple testing with no correction

Results interpretation

  • C Significance rather than actual p values quoted

  • C Means compared but no estimate of variability, i.e. confidence


  • C Statistical significance favoured over clinical significance

Table 3

permutations of results of these two tests we can define several measures of the efficacy of our diagnostic test as shown 6 (Table 2).

Sensitivity ¼ Proportion of positive results ðor patients who have the conditionÞ that are correctly identified by the test

¼ a=a þ c ðif highly sensitive test can be used to rule condition outÞ

Specificity ¼ Proportion of negative results ðor patients without the conditionÞ that are correctly identified by the test ¼ d=b þ d

Positive predictive value ¼ Proportion of patients with a positive test result who are correctly diagnosed ¼ a=a þ b

Negative predicitve value ¼ Proportion of patients with a negative test result who are correctly diagnosed ¼ d=c þ d


In summary we can see that statistical tests can only be mean- ingfully and appropriately applied if we understand the proper- ties of our dataset as well as basing our analysis on a suitable research question. However, the statistical tests themselves are only half the battle; the remainder being how to interpret the results to generate credible findings. Table 3 summarizes the common errors in both analysis and interpretation to conclude this discourse and serve as a reminder of the main points to consider before embarking on statistical analysis. A


  • 1 Hopewell S, Loudon K, Clarke MJ, Oxman AD, Dickersin K. Publication bias in clinical trials due to statistical significance or direction of trial results. Cochrane Database Syst Rev; 2009; Issue 1.

  • 2 Qureshi AA, Ibrahim T. Study design in clinical orthopaedic trials. Orthop and Trauma 2010; 24: 229e40.

  • 3 Ballentine LE. The statistical interpretation of quantum mechanics. Rev Mod Phys 1970; 42: 358e81.

  • 4 Kirkwood BR, Sterne JAC. Essential medical statistics. 2nd edn. Blackwell, 2003.

  • 5 Altman DG, Bland JM. The normal distribution. BMJ 1995; 310: 298.

  • 6 Bland M. An introduction to medical statistics. 3rd edn. Oxford University Press, 2000.

  • 7 Campbell MJ, Machin D. Medical statistics: a commonsense approach. 3rd edn. Wiley, 1999.

  • 8 Altman DG. Statistics and ethics in medical research. III How large a sample? BMJ 1980; 281: 1336e8.

  • 9 Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ 1995; 311: 485.


2010 Elsevier Ltd. All rights reserved.