Beruflich Dokumente
Kultur Dokumente
DECISION MAKING
STATISTICAL ANALYSIS
Fallacies of
Statistical
Significanc
The case for focusing
your data analyses
beyond significance
tests
by Necip Doganaksoy,
Gerald J. Hahn and
William Q. Meeker
0.06
data analysiswill address the important distinction
between statistical significance and practical signifi-
cance, and urge practitioners to carry their analyses
0.05
beyond significance testsoften by constructing
appropriate confidence intervals.
0.04
Statistical significance and p-values
in reliability data analysis
0.03
Density
(MPa) and a shape parameter (standard deviation of log ++ Statistical significance whenas a consequence of a
tensile strength) of 0.25. This distribution is shown by the large sample sizethe effect under consideration is, in
red dashed curves in Figures 1 (p. 57) and 2. fact, without practical significance.
It is desired to replace the old furnace with a new ++ Failure to achieve statistical significanceas a conse-
furnace. Before making this important process change, quence of a small sample sizewhen, in fact, the effect
however, the manufacturer wants to compare the tensile is of practical significance.
strength distributions of the two furnaces and, in particu-
lar, the medians of the two distributions. Scenario one: Statistical significance
With this in mind, a random sample of specimens without practical significance
built using the new furnace are to be tested to compare, Assumed tensile strength distribution for new furnace.
among other things, the median of the estimated distri- Under this scenario, assume that the tensile strength
bution of tensile strength for the new furnace with the distribution for specimens from the new furnace, like that
assumed known median tensile strength of 35 MPa for for the old furnace, is lognormal with a shape parameter of
the old furnace. 0.25. However, now assume that the median of this distri-
You can argue that in this example, like many other bution has shifted (slightly down) from 35 MPa to 34 MPa.
similar applications, there is no reason for product from This distribution is shown by the blue solid curve in
two different furnaces to have identical tensile strength Figure 1. The small deterioration in tensile strength for the
distributions or, for that matter, identical distribution new furnace, as compared with the old furnace, can be
medians. It was felt, however, that the difference between readily compensated for by taking other measures andif
furnaces might not be sufficiently large to be considered it were knownwould not be regarded as sufficiently large
practically important. to disqualify the new furnace.
We will now describe two scenarios, resulting in: Assumed tensile strength data. The preceding
information about the new furnace, unlike that for the
FIGURE2
old furnace, is unknown to the manufacturer. All that is
Alternative assumed lognormal available is tensile strength data from a sample of 1,000
specimens, randomly selected from the new furnace.
forspecimens
ulation from the assumed tensile strength distribution for
the new furnace, shown in Figure 1.
Statistical significance analysis. We conducted a
0.06
Probability of statistically
hypothesis. In this example, this rejection
would result in the conclusion that there
is a statistically significant difference
sample sizes for the new furnace and different signifi- stated that the median tensile strength for the new furnace
cance levels, assuming the tensile strength distributions is 35 MPa. This analysis resulted in insufficient evidence of a
for the two furnaces, shown in Figure 1. statistically significant difference between the two distribu-
In particulardespite the small true differences tion medians at the 5% significance level (that is, failure to
between the medians of the tensile strength distributions reject the null hypothesis) and a p-value of 0.21.
for the two furnaces shown in Figure 1evidence at a 5% The preceding results might suggest to some that it
significance level of a statistically significant difference in makes no difference, with regard to tensile strength, which
medians seems likely for sample sizes of 500 specimens furnace is used. This, however, would be an incorrect
or more for the new furnace and seems almost assured for conclusion in light of the underlyingbut unknown to the
samples of 1,500 or more. manufacturertrue distribution for the new furnace, shown
On the other hand, for much smaller sample sizes from in Figure 2, and the actual difference in distribution medians
the new furnace, such as 100 specimens, it is unlikely that between the two furnaces.
the analysis of the data will lead to a statistically signifi- Is the preceding result typical? Further analysis shows
cant difference. that, given the underlying tensile strength distribution for
the new furnace shown in Figure 2, the probability of getting
Scenario two: Failure to achieve a p-value of 0.05 or more (and failing to declare statistical
statistical significance when significance), in comparing the estimated median from
there is practical significance a random sample of 10 specimens from the new furnace
Assumed tensile strength distribution for the new with the known median of 35 MPa for the old furnace, is
furnace. Now consider the scenario in which the differ- 0.72 (obtained from Table 2). In addition, we found that in
ence in tensile strength distributions for specimens from randomly selecting 10 specimens from the new furnace
the two furnaces is regarded to be large enough to be of and comparing the median of this sample with the known
practical significance. In particular, lets again assume that median of 35 MPa for the old furnace, the probability
the tensile strength distribution for specimens from the of getting a p-value of 0.21 or greater is 0.42. Thus, the
new furnace, like that from the old furnace, is lognormal results again are typical of what you might expect in such
with a shape parameter of 0.25. simulations.
Now assume that the median of the distribution has More on impact of sample size. Table 2 shows the
shifted from 35 MPa to 31 MPa. This distribution is shown probability of failing to establish a statistically significant
and contrasted with the known distribution for the old difference between the medians of the tensile strength dis-
furnace by the blue solid curve in Figure 2. This 4 MPA tributions for the two furnaces for different sample sizes for
difference in the medians between the two furnaces is the new furnace and different significance levels, assuming
of practical significance and, if known, would, in fact, be the tensile strength distributions for the new furnace, shown
sufficient reason to reject the new furnace. in Figure 2.
Assumed tensile strength data. The preceding In particulardespite the relatively large (in terms of
information about the new furnace, unlike that for the practical significance) true differences between the tensile
old furnace, is again unknown to the manufacturer. All strength distributions for the two furnaces shown in Figure
that is available under this scenario is tensile strength 2failure to establish a statistically significant difference at
data from a scant sample of 10 (simulated) specimens,
randomly selected from this furnace. This small sample
is in strong contrast to the large sample (1,000 speci- Consider the scenario in which the difference
mens) for scenario one. For purposes of illustration, we in tensile strength distributions for specimens
created such a sample by randomly selecting 10 obser- from the two furnaces is regarded to be large
vations via computer simulation from the assumed enough to be of practical significance.
tensile strength distribution for the new furnace shown
in Figure 2.
Statistical significance analysis. We conducted a
statistical comparison of the median tensile strength for
the new furnace, as estimated from the 10 (simulated)
specimens from this furnace, versus the known median
tensile strength of 35 MPa for the old furnace. In partic-
ular, the null hypothesis for the significance test again
statistically significant
10 or less. On the other hand, for larger samples
sizes, such as 75 or more, the analysis of the
data will likely lead to a statistically significant
difference.
difference for Figure 2 analysis
A confidence interval is Number of Significance level
generally more informative specimens sampled
from the new furnace 10% 5% 1%
The preceding discussion has hopefully con-
vinced you to focus your data analyses beyond 10 0.5893 0.7211 0.9039
significance tests. But what do we recommend? 20 0.3271 0.4599 0.7228
First, use an incisive plot of the data (for 50 0.0410 0.0801 0.2297
example, sample data displayed on a lognormal
75 0.0059 0.0143 0.0626
probability plot1). Then, to assess the practical
significance of the results, construct an appro- 100 0.0007 0.0022 0.0140
priate statistical interval. A variety of statistical This table shows the probability of failing to establish a
intervals exist to address specific applications. 2 statistically significant difference in median tensile strength
The most frequently used among these is a between the old and new tensile strength distributions shown
in Figure 2 for different sample sizes from the new furnace.
confidence interval. In the current application,
for example, it would be appropriate to compute
a confidence interval for the median tensile calculated from different sets of data, about 95% of such
strength for the new furnace and compare it intervals would, in fact, contain the true median.
with the known median tensile strength of the The deviation of the median tensile strength for the new
old furnace. furnace from 35 MPa is statistically significant at the 5%
As noted later, there is a close relationship significance levelthat is, the null hypothesis is rejected. This
between confidence intervals and significance is because the 95% confidence interval for the new furnace
tests. Confidence intervals, however, generally median does not include the null hypothesis value of 35 MPa.
give much more information than do signifi- Moreover, the deviation of the median tensile strength
cance tests. This is because confidence intervals from 35 MPa could be as small as 0.3 MPa (35-34.7) and is
provide quantitative bounds on the statistical unlikely to exceed 1.4 MPa (35-33.6). Even the latter devia-
uncertainty, providing direct information about tion, however, would not be regarded as being of practical
practical significance rather than just an accept- significancedespite its statistical significance.
or-reject hypothesis decision. Scenario two. A 95% confidence interval on the median of
Such intervals also shrink in lengthas they the tensile strength distribution for specimens from the new
shouldwith an increase in sample size and the furnace is calculated from the 10 (simulated) specimens to
associated reduction in uncertainty. In addi- be (26.6, 37.5) MPa. Because this interval includes 35 MPa,
tion, confidence intervals generally are easier the data do not provide evidence at the 5% significance level
to explain to management and customers than to reject the null hypothesis that the median of the tensile
hypothesis tests. To illustrate this, lets return to strength distribution for the new furnace is 35 MPa.
the earlier scenarios. Moreover, the limited data for the new furnace suggests
Scenario one. A 95% confidence interval on that it is possible that the median of the tensile strength dis-
the median of the tensile strength distribution tribution for the new furnace is as much as 8.4 MPa (3526.6)
for specimens from the new furnace is calcu- smaller than that for the old furnace. But it also appears
lated from the 1,000 (simulated) specimens possible that the median tensile strength for the new furnace
to be (33.6, 34.7) MPa. Roughly speaking, this is 2.5 MPa (37.535) greater than that of the old furnace.
means that you can be 95% sure that the median Both conclusions would be regarded as being of practical
tensile strength for the new furnace is between significance. Thus, further analysis of the new furnace data
33.6 and 34.7 MPa. More precisely, you can suggests that there could be a difference of practical impor-
assert that if there were many such intervals tance in favor of either the new or the old furnace.
What this really means is that the specimen-to-specimen vari- ++ The two scenarios in this column are focused on the comparison
ability is too large, and/or the sample size from the new furnace is of the medians of the tensile strength distributions for the two
too small to permit you to draw definitive conclusions, and addi- furnaces because this was of greatest interest in this application.
The analyses can be readily modified, however, to apply for
tional samples from the new furnace are needed to obtain more
other population properties of interest, such as different tensile
conclusive results. strength distribution percentiles or the probability of tensile
strength falling below a specified value.
TECHNICAL NOTES
++ The two scenarios involved complete samples in that a quanti-
++ Details of the statistical analyses for the two scenariosincluding the
tative tensile strength reading was obtained for all specimens.
construction of the statistical significance tests and the calculation of the
In many reliability applications, the data need to be analyzed
confidence intervals, as well as the data files for the 1,000 and 10 simulated
before all units have failed, resulting in so-called censored
observations for the two scenarioscan be found on this articles webpage
observations. The general results presented in this column also
at www.qualityprogress.com.
extend to such situations.
++ The calculation of significance tests and statistical intervals assume that the
++ The controversy surrounding significance testing, and the related
given data can be considered to be a random sample from the population(s)
concept of p-values, has been in the spotlight recently and is the
of interest. They, therefore, quantify only the uncertainty due to sampling
subject of a statement by the American Statistical Association.
variability. In our examples, the data are from past production, but, in prac-
See Ronald L. Wasserstein and Nicole A. Lazar, The ASAs State-
tice, youre typically interested in performance for future production, which
ment on p-values: Context, Process, and Purpose, American
makes this, borrowing from W. Edwards Demings terminology, an analytic
Statistician, Vol. 70, No. 2, 2016, pp. 129-133.
study. See W. Edwards Deming, On Probability as a Basis for Action, The
American Statistician, Vol. 29, No. 4, 1975, pp. 146-152. Thus, a basic assump- REFERENCES
tion underlying any future projections of our analyses is that the statistical 1. Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker
distributions of tensile strengths of specimens from the two furnaces do not Reliability (or Product Life) Data Analysis: A Case Study,
change from the past to the future. When this assumption is questionable, Quality Progress, June 2000, pp. 115-121.
it might be appropriate to refrain from doing any statistical analysis or, as a 2. William Q. Meeker, Gerald J. Hahn and Luis A. Escobar, Statistical
minimum, to make clear the limitations of such analyses. Intervals: A Guide for Practitioners and Researchers, second
edition, Wiley, 2017.