Beruflich Dokumente
Kultur Dokumente
However, this paper endeavors to provide an investigation into discretionary adjustments of intolerance instruments which are made, not to mitigate false accept risk, but as a preemptive
measure in an attempt to reduce the potential for future out-of-tolerance (OOT) conditions. A
reduction in OOT probability can translate into improved EOPR reliability. Such adjustments are
often made on the bench at the discretion of the calibration technician when the observed error
is deemed too close to the tolerance limits. Organizations sometimes may have a blanket policy
or threshold in place that defines, in a broad general sense, what too close is. This adjustment
threshold may be 70 % of specification, 80 % of specification, or any other arbitrary value. The
intent of this policy may be to improve accuracy and mitigate future OOT conditions, improving
EOPR. The objective of this paper is to investigate whether such adjustments can, in fact,
provide an increase in accuracy and a reduction in OOT probability (increased EOPR) and, if so,
by how much and under what conditions. The possibility of calibration adjustments unwittingly
degrading performance is also investigated.
There are no national or international standards which dictate or require adjustment during
calibration, unless an instrument is found OOT or the observed error fails to meet guardband
criteria. ANSI/NCSL Z540.3-2006 and ISO/IEC-17025:2005 do not mandate discretionary
adjustment of in-tolerance items [1 - 3]. The International Vocabulary of Metrology (VIM)
clearly defines calibration, verification, and adjustment as separate actions [4]. Adjustment is not
a defacto aspect of calibration. As defined by the VIM:
Calibration: Operation that, under specified conditions, in a first step, establishes a relation
between the quantity values with measurement uncertainties provided by measurement standards
and corresponding indications with associated measurement uncertainties and, in a second step,
uses this information to establish a relation for obtaining a measurement result from an
indication NOTE 2: Calibration should not be confused with adjustment of a measuring
system, often mistakenly called self-calibration, nor with verification of calibration.
Adjustment of a measuring system: Set of operations carried out on a measuring system so that
it provides prescribed indications corresponding to given values of a quantity to be measured
NOTE 2: Adjustment of a measuring system should not be confused with calibration, which is a
prerequisite for adjustment.
Verification: Provision of objective evidence that a given item fulfils specified requirements
EXAMPLE 2: Confirmation that performance properties or legal requirements of a measuring
system are achievedNOTE 3: The specified requirements may be, e.g. that a manufacturer's
specifications are met NOTE 5: Verification should not be confused with calibration.
Despite these established definitions, there have been recent accounts where entities regulated by
the Food and Drug Administration (FDA) have received Form-483 Investigational Observations
and Warning Letters arising from the failure to always adjust in-tolerance instruments (i.e. all
instruments) during calibration [5]. These incidents may be attributable to a nebulous distinction
between the definitions of calibration, verification, and adjustment. References to similar events
in regulated industries have also been published [6 - 8] where calibration requirements have
been inferred to mandate adjustment during calibration.
Consistent with the VIM definitions, a calibration, where a pass/fail conformance decision is
made, also satisfies the definition of a verification. However, the converse is not true; not all
verifications are calibrations. This distinction is important because, for example, not all
calibrations result in a pass/fail conformance decision being issued. Such is the case for most
calibrations performed by National Metrology Institutes (NMI) and some reference standards
laboratories where calibrations are routinely performed and no pass/fail conformance decision is
made. The definition of calibration requires no such conformance decision be rendered. In these
cases, calibration consists of the measurement data reported along with the measurement
uncertainty. Such operations still adhere to the VIM definition of calibration, but they are not
verifications, since no statement of conformance to metrological specifications is given.
However calibrations, which do result in a statement of conformance (i.e. pass/fail) with respect
to an established metrological specification, are also verifications. In such scenarios, the
definitions of calibration and verification are both applicable. However, the absence of
adjustment of a measuring system during calibration in no way negates or disqualifies the
proper usage of the term calibration. Many instruments do not lend themselves to adjustment
and are not designed to be physically or electronically adjusted to periodically nominalize their
performance for the purpose of reducing measurement errors; yet, such instruments are still quite
capable of being calibrated. The distinction is readily apparent as indicated by ANSI/NCSL
Z540.3-2006 section 5.3a and 5.3b shown below [1, 2].
5.3 Calibration of Measuring and Test Equipment
a) Where calibrations provide for reporting measured values, the measurement uncertainty
shall be acceptable to the customer and shall be documented.
b) Where calibrations provide for verification that measurement quantities are within specified
tolerances, the probability that incorrect acceptance decisions (false accept) will result from
calibration tests shall not exceed 2 % and shall be documented. Where it is not practicable
to estimate this probability, the test uncertainty ratio shall be equal to or greater than 4:1.
2. NCSLI RP-1: Establishment and Adjustment of Calibration Intervals
As stated, discretionary adjustments of in-tolerance instruments are often left to the judgment of
the calibration technician, or governed by organizational policy. When deferred to the discretion
of the technician, such adjustments are optimally based on professional evaluation by qualified
personnel with experience and training in the metrological disciplines for which they are
responsible. Heuristic assessment of instrument adjustment requirements, combined with
empirical data and epistemological knowledge gathered over multiple calibration operations may
provide a somewhat intuitive qualitative notion of when adjustment might be beneficial.
However, there is little formal quantitative guidance on this subject. The most authoritative
reference on such discretionary adjustments is found in NCSLI Recommended Practice RP-1,
Establishment and Adjustment of Calibration Intervals, henceforth referred to as NCSLI RP-1
[9]. Appendix G of NCSLI RP-1 refers to three adjustment policies as
1) Renew-always
2) Renew-if-failed
3) Renew-as-needed
2015 NCSL International Workshop & Symposium
NCSLI RP-1 employs the term renew to convey an adjustment action. Herein, the renew-asneeded policy is synonymous with discretionary adjustment. As stated in RP-1 [9],
At present, no inexpensive systematic tools exist for deciding on the optimal renewal policy for
a given MTE. While it can be argued that one policy over another should be implemented on an
organizational level, there is a paucity of rigorously demonstrable tests that lead to a clear-cut
decision as to what that policy should be. The implementation of reliability models, such as the
drift model, that yield information on the relative contributions of random and systematic effects,
seems to be a step in the right direction.
The objective of this paper is to provide some additional discourse regarding the random and
systematic drift effects associated with some instruments and to provide insight as to the impact
of these effects on EOPR reliability under various discretionary adjustment thresholds. As
provided in NCSL RP-1 [9], discretionary adjustments may be influenced by one or more of the
following criteria, where this paper focuses specifically on questions #4, #5, #6, & #7:
1) Does parameter adjustment disturb the equilibrium of a parameter, thereby hastening the
occurrence of an out-of-tolerance condition?
2) Do parameter adjustments stress functioning components, thereby shortening the life of
the MTE?
3) During calibration, the mechanism is established to optimize or "center-spec''
parameters. The technician is there, the equipment is set up, the references are in-place.
If it is desired to have parameters performing at their nominal values, is this not the best
time to adjust?
4) By placing parameter values as far from the tolerance limits as possible, does adjustment
to nominal extend the time required for re-calibration?
5) Do random effects dominate parameter value changes to the extent that adjustment is
merely a futile attempt to control random fluctuations?
6) Do systematic effects dominate parameter value changes to the extent that adjustment is
beneficial?
7) Is parameter drift information available that would lead us to believe that not adjusting
to nominal would, in certain instances, actually extend the time required for recalibration?
8) Is parameter adjustment prohibitively expensive?
9) If adjustment to nominal is not done at every calibration, are equipment users being
short-changed?
10) What renewal practice is likely to be followed by calibrating personnel, irrespective of
policy?
11) Which renewal policy is most consistent with a cost-effective interval analysis
methodology?
2015 NCSL International Workshop & Symposium
Weiss [10] addressed the issue of calibration adjustment in some detail in 1991 in a paper
entitled, Does Calibration Adjustment Optimize Measurement Integrity?. Weiss showed that in
the presence of purely random errors associated with a normal probability density function,
where no statistical difference in the mean value of the distributions exists from one calibration
to the next, that calibration adjustment can degrade instrument performance. Weiss and several
other authors [10 14, 56 60] have drawn upon the popular Deming funnel experiment to
illustrate how tampering with or adjusting a calibrated system in a state of statistical control
can introduce additional unwanted variation into a process rather than reduce existing variation1.
As Weiss demonstrates, if the process exhibits purely random error represented by a normal
probability density function, the effect of this tampering is to increase the variance 2 by a factor
of 2. This is equivalent to increasing the standard deviation to a value of 2 , or ~1.414. If
the specification limits were originally set to achieve 95 % confidence (1.96), then this
increased variation from tampering results in an in-tolerance probability (EOPR) of only 83.4 %.
This value becomes important for the interpretations of the results later in this paper in Section 6.
Shah [11] likewise comments in 2007 stating, Calibration has nothing to do with adjustment.
When a measurement system is adjusted to measure the nominal value whether it is within
tolerance or not... Is this advisable or is it causing more harm than good?... Some adjustments
are justified. Others are not. A calibration technician has to make an instant decision on a
measurement taken... Making a bad decision can lead to quality problems It is shown that a
stable process with its inherent natural (random) variation should be left on its own.
Abell [13] also touched on this issue in 2003 noting that, one might be inclined to readjust
points to the center of the specification. The temptation to optimize all points by adjusting to
the exact center between the specifications causes two problems. The first is that it might not be
possible to adjust the instrument on a re-calibration to an optimal center value, even with an
expensive repair. Second, a stable instrument that is unlikely to drift will be made worse by
attempts to optimize its performance.
Payne [14] in 2005 makes similar comments. There are two reasons adjustment is
not part of the formal definition of calibration: (1) The historical calibration data on
an instrument can be useful when describing the normal variation of the instrument or a
population of substantially identical instruments... (2) a single measurement from that process
is a random sample from the probability density function that describes it. Without other
knowledge, there is no way to know if the sample is within the normal variation limits. The
history gives us that information. If the measurement is within the normal variation and not
outside the specification limits, there is no reason to adjust it. In fact, making an adjustment
could just as likely make it worse as it could make it better. W. Edwards Deming discusses the
problem of overadjustment in chapter 11 of Out of the Crisis.
1
In the Deming experiment, a stationary funnel is fixed a short distance directly above the center of a target and marbles are
dropped through the funnel onto the target; the resting spot of each marble is marked. Repeated cycles of this will display resting
spots in a random pattern with a natural fixed common-cause variation () around the targets center, following so-called rule
#1 of never adjusting the position of the funnel. Alternatively, if the operator follows rule #2 and futility attempts to adjust the
position of the funnel after each drop (equal and opposite to the last observed error), the variation of the resting spots increases.
ISO/TS 16949:2009 [57], which supersedes QS9000 quality management requirements for the
automotive industry, also refers to the phenomenon of over-adjustment in Section 8.1.2 by
requiring, Basic statistical concepts, such as variation, control (stability), process capability
and over-adjustment, shall be understood throughout the organization.
The MSA Reference Manual [56] also describes over-adjustment, stating:
the decision to adjust a manufacturing process is now commonly based on measurement
data. The data, or some statistic calculated from them, are compared with statistical control
limits for the process, and if the comparison indicates that the process is out of statistical
control, then an adjustment of some kind is made. Otherwise, the process is allowed to run
without adjustment [However] Often manufacturing operations use a single part at the
beginning of the day to verify that the process is targeted. If the part measured is off target, the
process is then adjusted. Later, in some cases another part is measured and again the process
may be adjusted. Dr. Deming referred to this type of measurement and decision-making as
tampering Over-adjustment of the process has added variation and will continue to do so...
The measurement error just compounds the problem... Other examples of the funnel experiment
are (1) Recalibration of gages based on arbitrary limits i.e., limits not reflecting the
measurement systems variability (Rule 3). (2) Autocompensation adjusts the process based on
the last part produced. (Rule 2).
Nolan and Provost [58] in 1990 also provide the following, Decisions are made to adjust
equipment to calibrate a measurement device etc. All these decisions must consider the
variation in the appropriate measurements or quality characteristics of the process The aim of
the adjustment is to bring the quality characteristic closer to the target in the future. ...there are
circumstances in which the adjustments will improve the performance of the process, and there
are circumstances in which the adjustment will result in worse performance than if no
adjustment is made... Continual adjustment of a stable process, that is, one whose output is
dominated by common causes, will increase variation and usually make the performance of the
process worse.
Bucher, in The Quality Calibration Handbook [59], and The Metrology Handbook [60], states
With regard to adjusting IM&TE, there are several schools of thought on the issue. On one end
of the spectrum, some (particularly government regulatory agencies) require that an instrument
be adjusted at every calibration, whether or not it is actually required. At the other end of the
spectrum, some hold that any adjustment is tampering with the natural system (from Deming)
and what should be done is simply to record the values and make corrections to measurements.
An intermediate position is to adjust the instrument only if (a) the measurement is outside the
specification limits, (b) the measurement is inside but near the specifications limits, where near
is defined by the uncertainty of the calibration standards, or (c) a documented history of the
values of the measured parameter shows that the measurement trend is likely to take it out of
specification before the next calibration due date.
The Weiss and Deming model [10] assume purely random variation for which adjustment is not
only futile, but actually detrimental. In such cases, adjustment or tampering results in an increase
to the standard deviation () of the process by a factor of 1.414, or about 41 %. However, if the
behavior is not purely random, the results can differ. As noted in NCSL RP-1 Appendix G [9],
However, if a systematic mean value change mechanism, such as monotonic drift, is introduced
into the model, the result can be quite different. For discussion purposes, modifications of the
model that provide for systematic change mechanisms will be referred to as Weiss-Castrup
models (unpublished) By experimenting with different combinations of values for drift rate
and extent of attribute fluctuations in a Weiss-Castrup model, it becomes apparent that the
decision to adjust or not adjust depends on whether changes in attribute values are
predominately random or systematic.
Appendix D of NCSL RP-1 describes ten Measurement Reliability Models with #9 being
systematic attribute drift superimposed over random fluctuations (drift model) [9]:
1) Constant out-of-tolerance rate (exponential model).
2) Constant-operating-period out-of-tolerance rate with a superimposed burn-in or wear out
period (Weibull model).
3) System out-of-tolerances resulting from the failure of one or more components, each
characterized by a constant failure rate (mixed exponential model).
4) Out-of-tolerances due to random fluctuations in the MTE attribute (random walk model).
5) Out-of-tolerances due to random attribute fluctuations confined to a restricted domain
around the nominal or design value of the attribute (restricted random-walk model).
6) Out-of-tolerances resulting from an accumulation of stresses occurring at a constant
average rate (modified gamma model).
7) Monotonically increasing or decreasing out-of-tolerance rate (mortality drift model).
8) Out-of-tolerances occurring after a specific interval (warranty model).
9) Systematic attribute drift superimposed over random fluctuations (drift model).
10) Out-of-tolerances occurring on a logarithmic time scale (lognormal model).
This paper investigates behavioral characteristics of instruments that are described by the #9
reliability model above, systematic attribute drift superimposed over random fluctuations (drift
model).
Background information provided in Appendix D of NCSLI RP-1 is highly enlightening with
respect to the Weiss-Castrup Drift model and the decision to adjust or not. Additional
information is also provided by Castrup [54].
A section from Appendix D of NCSLI RP-1 is provided here to facilitate an understanding of the
relationship between systematic and random components of behavior and their influence on both
interval and instrument adjustment decisions, where denotes the normal distribution function:
() =
1
2
()2
22
Where:
= random variable, = standard deviation, = mean
2015 NCSL International Workshop & Symposium
(, ) = (1 + 3 ) + (2 3 )
1 ( + )2 /2
=
1 3
1 2
1 ( )2 /2
=
2 3
2 2
2
2
=
[ (1 + 3 ) /2 (2 3) /2 ]
3 2
If random fluctuation is the dominating mechanism for attribute value changes over time, then the
benefit of periodic adjustment is minimal.
If drift or other systematic change is the dominating mechanism for attribute value changes over
time, then the benefit of periodic adjustment is high.
Obviously, use of the drift model can assist in determining which adjustment practice to employ for a
given attribute. By fitting the drift model to an observed out-of-tolerance time series and evaluating the
coefficient 3 it can be determined whether the dominant mechanism for attribute value change is
systematic or random. If 3 is small, then random changes dominate and a renew-if-failed only practice
should be considered. If 3 is large, then a renew-always practice should perhaps be implemented.
Copyright 2010 NCSLI. All Rights Reserved. NCSLI Information Manual. Reprinted here under the provisions of the
Permission to Reproduce clause of NCSLI RP-1.
The Weiss-Castrup Drift model described in NCSL RP-1 was primarily intended for the
determination, adjustment, and optimization of calibration intervals in association with Methods
S2 & S3, also called the Binomial Method and the Renewal Time Method, respectively [9].
The Weiss-Castrup drift model is investigated here with a focus on instrument adjustment
thresholds, rather than interval adjustment actions. That is, for a given fixed calibration interval,
how do various discretionary adjustment thresholds (0 % to 100 % of specification), in the
presence of both drift and random variation, affect EOPR reliability? Clearly, if the behavior is
purely random, as in the Weiss and Deming models, an adjust-always policy (0 % adjust
threshold) is detrimental to the instrument performance resulting in decreased EOPR.
However, if the behavior has any element of monotonic drift, as in the Weiss-Castrup Drift
model, an adjustment will be necessary at some point to prevent an eventual OOT condition
resulting from a true attribute bias due to drift. The difficulty manifests during calibration when
attempting to discriminate between attribute bias and a random error. Thus, investigating optimal
adjustment thresholds to maximize EOPR in the presence of random and systematic errors seems
a worthy endeavor. It is also prudent to consider that, even if an optimum adjustment threshold
is determined, there may be other administrative and managerial factors as described in NCSL
RP-1 Appendix G [9] that should be considered when formulating adjustment policies.
The policy of some U.S. Department of Defense military programs and third party OEM
accredited calibration laboratories has been to not routinely, by default, adjust most equipment
unless found out-of-tolerance. For example, The U.S. Navy has the policy of not adjusting test
equipment that are in tolerance. [15].
However, even under some programs which typically employ an adjust-only-if-OOT policy,
discretionary adjustments are still performed for select equipment types. For example, it is not
uncommon to always assign new calibration factors to microwave power sensors, or sensitivity
values to accelerometers, or coefficients to temperature sensors (e.g. RTS, PRTs, etc.),
regardless of the as-found condition of the device. In these cases, rather than judge in-tolerance
or out-of-tolerance based on published specifications, these decisions are often rendered based
on the previously assigned uncertainty, applicable to the assigned value. In these applications,
uncertainties must include a reproducibility component in the uncertainty budget that is
applicable over the calibration interval for stated conditions. Such estimates can be attained by
evaluation of historical performance.
3. Empirical Examples: Systematic Drift Superimposed Over Random Fluctuations
The idea that attribute bias can grow or drift over time is ubiquitous; indeed much of the history
of metrology and the impetus for calibration are predicated on this possibility. Examples of such
behavior are often encountered. The distinction between attribute bias arising from drift (or
otherwise), and a random error, is sometimes only discernable from the analysis of historical
data. Monotonic drift can be estimated using linear regression models. Such is the case with 10 V
DC zener voltage references. Calibration of these devices must be performed via comparison to
other characterized zeners, standard cells, or, in the most accurate cases, Josephson voltage
measurement systems. Due to the inherently low drift characteristics of commercial zener
references, it would not be possible to adequately detect or resolve drift without a measurement
system exhibiting high resolution, low noise, and zero (or well-characterized/compensated) drift.
2015 NCSL International Workshop & Symposium
The data represented in Figure 1 was acquired with a Josephson voltage measurement system.
The noise or variation observed in the data is primarily due to the zener under test and not to
the measurement standard, while all of the observed drift is attributable to the zener and none to
the measurement standard [16, 17]. It may be noted that the fluctuations about the predicted drift
line are not purely random in nature; they are pseudo-random.
Figure 3. Regression of calibration data with adjustments mathematically removed (R2 = 0.96)
However, in many cases of general purpose TMDE, detection of monotonic drift may be more
difficult to resolve due to domination by more random behavior or even special-cause variation
where instruments apparently step out-of-tolerance, rather than drift in a predictable manner.
Such an example, along with the regression analysis is shown in Figures 4 and 5. In such cases, a
model with random fluctuation superimposed on monotonic drift may not be the best model. One
of the other models proposed in NCSLI RP-1 may be more appropriate for such an instrument.
Figure 5. Regression w/ relatively low R2, indicating significant behavior not explained by drift
Figure 6. Measurement variation during calibration, compared with variation over cal interval.
Like the Weiss example and Deming funnel (rule #2), this model presented in this paper will
incorrectly assume the observed UUT error of +60 % shown in Figure 6 is attribute bias, even
under purely random behavior. Such an erroneous assumption will result in a calibration
adjustment magnitude of -60 % in a futile effort to correct for the observed random error. Like
Weiss and Deming, the correct assumption under purely random behavior is that the +60 % error
is common-cause and, if left undisturbed, will soon fluctuate and take on some other random
error represented by the UUT distribution. If this assumption is valid, the correct action would be
to do nothing and not adjust. The model presented here attempts to replicate the actions of the
calibrating technician, whom does not have knowledge of the magnitudes of the individual
systematic attribute bias and vs. random behavior; adjustments are made only on the observed
error at the time of calibration, which is comprised of both bias from drift and random error.
But this decision can only confidently be made if a-priori knowledge of the UUT error
distribution over the course of the calibration interval is known. In many cases, this distribution
is not readily available and discretionary calibration adjustments are made with the assumption
that all of the observed error is an actual attribute bias which will remain (or possibly grow)
unless an adjustment is performed. In an ideal case, the calibration technician would be able to
discern a short-term random error from an actual long-term attribute bias through examination
of historical data. At the time of calibration however, the two types of errors are often
inextricably combined into the observed error, whether obtained from a single reading or
several averaged measurements over a short period of time. The attribute bias is somewhat
hidden in the presence of random error. This is the behavior that is modeled herein.
4.2 Assumption (#3): 95 % Containment Specifications; Selection of Drift vs. Random
This is perhaps the most significant and sweeping assumption used in the model presented here.
The rationale used herein assumes that specifications are generally intended to adequately
accommodate or contain the majority of errors that an instrument might exhibit, with relatively
high confidence (e.g. 95 %). As such, the magnitudes of drift and random variability are selected
as complementary to one another and modeled under this assumption. This greatly restricts the
domain of possible instrument behavior investigated here. Instruments with drift and random
variation, which are both far better (lower) than their specifications might imply, are not modeled
here. Rationale for the assumption and selection of the particular domain of instrument behavior
investigated in this paper is provided here.
As stated in Section 5.4 of NASA HDBK-8739.19-2 [18], In general, manufacturer
specifications are intended to convey tolerance limits that are expected to contain a given
performance parameter or attribute with some level of confidence under baseline conditions
Performance parameters and attributes such as nonlinearity, repeatability, hysteresis,
resolution, noise, thermal stability and zero shift are considered to be random variables that
follow probability distributions that relate the frequency of occurrence of values to the values
themselves. Therefore, the establishment of tolerance limits should be tied directly to the
probability that a performance parameter or attribute will lie within these limits
The selection of applicable probability distributions depends on the individual performance
parameter or attribute and are often determined from test data obtained for a sample of articles
2015 NCSL International Workshop & Symposium
or items selected from the production population. The sample statistics are used to infer
information about the underlying parameter population distribution for the produced items. This
population distribution represents the item to item variation of the given parameter. The
performance parameter or attribute of an individual item may vary from the population mean.
However, the majority of the produced items should have parameter mean values that are very
close to the population mean. Accordingly, a central tendency exists that can be described by the
normal distribution
Baseline performance specifications are often established from data obtained from the testing of
a sample of items selected from the production population. Since the test results are applied to
the entire population of produced items, the tolerance limits should be established to ensure that
a large percentage of the items within the population will perform as specified performance
parameter distributions are established by testing a selected sample of the production
population. Since the test results are applied to the entire population of a given parameter, limits
are developed to ensure that a large percentage of the population will perform as specified.
Consequently, the parameter specifications are confidence limits with associated confidence
levels.
Accuracy specifications are of little benefit if they cannot be relied upon with reasonably high
confidence. Manufacturers sometimes publish specifications at both 95 % and 99 % confidence
levels [Ref 19]. After many calibration cycles, EOPR is then an empirical estimate of that
confidence; i.e. EOPR provides a measure or assessment of the probability for an instrument to
comply with its specifications at the end of its calibration interval.
However, the intent and conditions of specifications and any assumed confidence are subject to a
certain amount of interpretation and inference. Is the confidence level specifically stated or is it
implied? Does the confidence level of the specification apply to a single test point, or to a single
instrument, or to a population of similar instruments?
For example, the published absolute uncertainty specification at a 95 % confidence level for a
Fluke 8508A DMM, at 20 VDC, is 3.2 ppm [Ref 19]. The same 20 VDC point has a published
uncertainty of 4.25 ppm expressed at a 99 % confidence level. As manufactured and if properly
used, it might be reasonable for the end-user of this DMM to apply the stated specification at this
particular 20 VDC test point and assume the stated confidence level applies.
However, it can be argued that for multifunction instruments with multiple test points, the actual
confidence level of any individual test point must be much greater than 95 % or even 99 %
confidence if the instrument as-a-whole is expected to meet its specifications with the stated
confidence.
As Deaver has noted [20]each Fluke Model 5520A Multiproduct Calibrator is tested at 552
points on the production line prior to shipment. If each of the points has a 95% probability of
being found in tolerance, there would only be a 0.95552 = 0.000000000[0]51% chance of finding
all the points within the specification limits if the points are independent! Even if we estimate
100 independent points (about 2 per range for each function), we would still have only a 0.95100
= 0.6% chance of being able to ship the product.
2015 NCSL International Workshop & Symposium
Similar statements have been published by Dobbert [Ref 21, 21a]. A common assumption is
that product specifications describe 95% of the population of product items [emphasis added].
From the mean, , and standard deviation, , an interval of [ - 2, + 2] contains
approximately 95% of the population. However, when manufacturers set product specifications,
the test line limit is often set wider than 2 from the population mean...
For choosing the tolerance interval probability, a generally accepted minimum value is 95%.
However, manufacturers may choose a probability other than 95% for different reasons.
Consider again a multi-parameter product. Manufacturers wish to have high yields for the entire
product so that the yield considering all parameters meets the respective test line limits. If the
product parameters are statistically independent, the overall yield, in this case, is the product of
the probability for each parameter. For a product with just three independent parameters, each
with a test limit intended to give 95% probability, the product would only have a (0.95%)3 or
85.7 % chance of meeting all test line limits, which is perhaps unacceptable to the manufacturer.
For this reason, manufacturers select tolerance interval probabilities greater than 95% so that
the overall probability is acceptable.
When discussing drift, Dobbert also notes, Stress due to environmental change, as well as
everyday use, transport, aging and other factors may induce small changes in performance that
accumulate over time. In other words, products drift. The effect of drift is that from the time of
manufacture to the end of the initial calibration interval, it is likely that performance has shifted.
a population of product items also experiences a shift in the mean, a change in the standard
deviation, or both, due to the mechanisms associated with drift
To ensure products meet specification over the initial calibration interval, manufacturers may
include an additional guard band between the test line limit and the specification In the
simplest case, the total guard band between the test line limit and the specifications is the sum of
the individual guard band components for environmental factors, drift, measurement uncertainty
and any other required component. For example, = + + +
gives what is often the initial specification for a product. For the final specification,
manufacturers must consider manufacturing costs, market demands and competing product
performance.
When discussing manufacturers specifications propagated into uncertainty analyses, Dobbert
additionally notes, The GUM provides guidance for evaluation of standard uncertainty and
specifically includes manufacturers specifications as a source of information for Type-B
estimate To evaluate a Type-B uncertainty, the GUM gives specific advice when an
uncertainty is quoted at a given level of confidence. In this instance, an assumption can be made
that a Gaussian distribution was used to determine the quoted uncertainty. The standard
uncertainty can then be determined by dividing by the appropriate factor given the stated level of
confidence. Various manufacturers state a level of confidence for product specifications and
applying this GUM advice to product specifications quoted at a level of confidence is common
and accepted by various accreditation bodies.
The assumption, used in the model investigated by this paper, is that a specification represents
95 % containment probability of errors for a given test point; thus, the magnitude and proportion
of drift and random components are modeled accordingly (see Section 5). This may be a
significant assumption and highly conservative, especially where actual instrument performance
at a given test point exhibits systematic drift (bias) and random error components much lower
than represented by the specifications. For example, the domain of performance for instruments
displaying drift () of only 10 % of specification per interval and, at the same time, a random
component () of only 20 % of specification is not modeled here.
However, a great many instruments may be well capable of performing at such levels, i.e.
considerably better than their specifications would imply. This is especially true if one assumes
that the manufacturer has built significant margins or guardbands into the specifications and/or
that the confidence level of specifications is intended to represent an entire population of
instruments, or one instrument as-a-whole, rather than a single test point. Investigations of such
domains of behavior, and the effect on EOPR of various adjustment thresholds under such
improved instrument performance, may be highly insightful and are deferred to future
explorations2. Moreover, models where random variation () itself increases with time (such as
random-walk models) would be useful, with or without a drift component. Such a model, even in
the absence of monotonic drift, exhibits a time-dependent mechanism for transitioning to OOT3.
4.3 Assumption (#9): Mandatory Adjustment of OOT Conditions is Required
In practice, calibration laboratories, which are charged with verification as part of the calibration
process, are required to perform an adjustment if the UUT if it exceeds the allowable tolerance(s)
(>100 %) defined by the agreed-upon specifications. It is not generally acceptable to return an
item to the end-user as calibrated, while exhibiting an observed OOT condition.
However, in a Weiss or Deming model where fluctuations are purely random, this would appear
the correct course of action. The OOT condition, like the in-tolerance condition, should not be
adjusted; it should be allowed to remain with the assumption that it will soon decrease and takeon some other random value which will likely be contained within the specification limits. In this
regard, there is nothing special about the OOT condition. It is simply part of the normal
common-cause random variation that will inevitably, albeit rather infrequently (e.g. 5 %), fall
outside of specification limits which are intended to represent 95 % confidence or other
containment probability. Appendix G of NCSLI RP-1 perhaps best describes this as a logical
predicament when discussing non-adjustment of items as follows:
If we can convince ourselves that adjustment of in-tolerance attributes should not be made, how
then to convince ourselves that adjustment of out-of-tolerance attributes is somehow beneficial?
For instance, if we conclude that attribute fluctuations are random, what is the point of adjusting
attributes at all? What is special about attribute values that cross over a completely arbitrary
line called a tolerance limit? Does traversing this line transform them into variables that can be
controlled systematically? Obviously not.
More on the topic of non-adjustment of OOT conditions is presented later in Section 6.
2
3
The author thanks Jonathan Harben of Keysight Technologies for these astute suggestions.
The author thanks Dr. Howard Castrup of Integrated Sciences Group for this valuable observation.
The model presented herein concedes to the conventional industry practice which mandates
adjustment of items which are observed to be out-of-tolerance. Where the observed error is
predominately a long-term attribute bias, resulting from systematic monotonic drift or
otherwise, adjustment is a beneficial action. Such attribute bias is likely to remain or possibly
grow larger if left unadjusted. However, where the observed error resulted predominately from a
short-term random event, adjustment will be the incorrect decision. Like the calibration
technician, this model assumes (correctly or incorrectly) that all observed as-received errors
represent systematic attribute bias; adjustment actions will be implemented according to the
adjustment threshold parameter set for the model (0 % to 100 % of specification). In this sense,
the model feigns ignorance of the constituent proportion of random to attribute bias during
adjustment actions but, in actuality, is privy to the amount of attribute bias at all times in the
simulation. For investigational purposes, adjustment thresholds >100 % of specification are
briefly discussed, although they are believed unlikely to find application in most calibration
laboratories.
5. Modeling and Selection of Magnitude for Drift and Random Variation.
The illustration in Figure 7 represents the general concept of monotonic drift superimposed on
constant random variation.
Figure 7 represented random variation as a normal probability distribution with constant width (
= constant). However, if the specification limits are intended to provide a containment
probability of 95 % as discussed in Section 4.2, then any allowable drift must result in a
commensurate reduction in the amount of allowable random variation in order to still provide a
95 % confidence. In the model used herein, the amount of drift is first selected as a percentage
(0 % to 100 %) of the allowable specification over one interval. This will result in a systematic
drift-induced attribute bias at the end of one interval equal to the amount of specified drift. OOT
incidents will tend towards the direction of drift; e.g. for a positive drift allowance, OOT
conditions will predominately be found exceeding the upper specification limit in only one tail of
the distribution. The resulting drift, after one interval, forms the mean () of the normally
distributed random component.
Since the intent of the accuracy specification is assumed to represent a 95 % containment
probability for the error, the remaining portion of the specification is then modeled as a normally
distributed random component with a standard deviation of () selected to still provide 95 %
containment (see Table 1 and Figure 8). This complementary aspect of these two components is
necessary to provide the desired containment probability. As discussed in Section 4.2,
specifications are often, directly or implied, provided by the OEM with an allowance for drift
designed into them and provided at a relatively high confidence level. This is the basis for the
choice of magnitudes for the model used here. As the drift component dominates and approaches
the 100 % specification limit, the random component approaches zero. That is, as the systematic
drift () increases, the random variation () decreases, as shown in Figure 8.
Table 1. Magnitude of drift () and random () components, modeled to maintain 95 % in-tolerance confidence.
() Given as percentage of the specification.
() Given as a percentage of the specification per interval.
Mean
S.D.
Ratio
Left Tail
Right Tail
Mean
S.D.
Ratio
Left Tail
Right Tail
Mean
S.D.
Ratio
Mean
S.D.
Ratio
()
()
( / )
OOT
OOT
OOT
OOT
()
()
( / )
()
()
( / )
Random
Prob.
Prob.
()
Random
( / )
Drift
()
Drift
Prob.
Prob.
Drift
Random
Drift
Random
0%
51.021 %
0.000 2.500 %
2.500 %
26 %
44.386 %
0.586
0.226 %
4.774 %
51 %
29.790 %
1.71
76 %
14.591 %
5.21
1%
51.011 %
0.020 2.385 %
2.614 %
27 %
43.881 %
0.615
0.190 %
4.810 %
52 %
29.181 %
1.78
77 %
13.983 %
5.51
2%
50.981 %
0.039 2.271 %
2.729 %
28 %
43.363 %
0.646
0.158 %
4.842 %
53 %
28.573 %
1.85
78 %
13.375 %
5.83
3%
50.933 %
0.059 2.158 %
2.843 %
29 %
42.834 %
0.677
0.130 %
4.870 %
54 %
27.966 %
1.93
79 %
12.767 %
6.19
4%
50.864 %
0.079 2.044 %
2.956 %
30 %
42.291 %
0.709
0.106 %
4.894 %
55 %
27.358 %
2.01
80 %
12.159 %
6.58
5%
50.776 %
0.098 1.933 %
3.068 %
31 %
41.739 %
0.743
0.085 %
4.915 %
56 %
26.749 %
2.09
81 %
11.551 %
7.01
6%
50.668 %
0.118 1.822 %
3.178 %
32 %
41.177 %
0.777
0.067 %
4.933 %
57 %
26.142 %
2.18
82 %
10.943 %
7.49
7%
50.539 %
0.139 1.712 %
3.287 %
33 %
40.606 %
0.813
0.053 %
4.947 %
58 %
25.534 %
2.27
83 %
10.335 %
8.03
8%
50.392 %
0.159 1.605 %
3.395 %
34 %
40.029 %
0.849
0.041 %
4.960 %
59 %
24.926 %
2.37
84 %
9.727 %
8.64
9%
50.224 %
0.179 1.499 %
3.500 %
35 %
39.444 %
0.887
0.031 %
4.969 %
60 %
24.318 %
2.47
85 %
9.119 %
9.32
10 %
50.038 %
0.200 1.396 %
3.604 %
36 %
38.856 %
0.926
0.023 %
4.977 %
61 %
23.710 %
2.57
86 %
8.511 %
10.1
11 %
49.831 %
0.221 1.296 %
3.705 %
37 %
38.263 %
0.967
0.017 %
4.983 %
62 %
23.102 %
2.68
87 %
7.904 %
11.0
12 %
49.603 %
0.242 1.198 %
3.803 %
38 %
37.666 %
1.01
0.012 %
4.988 %
63 %
22.494 %
2.80
88 %
7.296 %
12.1
13 %
49.356 %
0.263 1.103 %
3.898 %
39 %
37.066 %
1.05
0.009 %
4.991 %
64 %
21.886 %
2.92
89 %
6.687 %
13.3
14 %
49.088 %
0.285 1.011 %
3.989 %
40 %
36.464 %
1.10
0.006 %
4.994 %
65 %
21.278 %
3.05
90 %
6.080 %
14.8
15 %
48.801 %
0.307 0.922 %
4.078 %
41 %
35.861 %
1.14
0.004 %
4.996 %
66 %
20.670 %
3.19
91 %
5.472 %
16.6
16 %
48.495 %
0.330 0.838 %
4.163 %
42 %
35.256 %
1.19
0.003 %
4.998 %
67 %
20.062 %
3.34
92 %
4.864 %
18.9
17 %
48.168 %
0.353 0.757 %
4.243 %
43 %
34.650 %
1.24
0.002 %
4.998 %
68 %
19.454 %
3.50
93 %
4.256 %
21.9
18 %
47.821 %
0.376 0.680 %
4.320 %
44 %
34.043 %
1.29
0.001 %
4.999 %
69 %
18.846 %
3.66
94 %
3.648 %
25.8
19 %
47.456 %
0.400 0.608 %
4.393 %
45 %
33.435 %
1.35
0.001 %
4.999 %
70 %
18.238 %
3.84
95 %
3.040 %
31.3
20 %
47.071 %
0.425 0.540 %
4.461 %
46 %
32.828 %
1.40
0.000 %
4.999 %
71 %
17.630 %
4.03
96 %
2.432 %
39.5
21 %
46.666 %
0.450 0.476 %
4.524 %
47 %
32.220 %
1.46
0.000 %
4.999 %
72 %
17.022 %
4.23
97 %
1.824 %
53.2
22 %
46.244 %
0.476 0.417 %
4.583 %
48 %
31.613 %
1.52
0.000 %
5.000 %
73 %
16.415 %
4.45
98 %
1.216 %
80.6
23 %
45.805 %
0.502 0.362 %
4.638 %
49 %
31.005 %
1.58
0.000 %
5.000 %
74 %
15.807 %
4.68
99 %
0.608 %
163
24 %
45.347 %
0.529 0.312 %
4.687 %
50 %
30.398 %
1.64
0.000 %
5.000 %
75 %
15.199 %
4.93
100 %
0.000 %
N/A
25 %
44.874 %
0.557 0.267 %
4.733 %
AS-RECEIVED
In Tol ?
Adjust ?
(eOBS < Y
(eOBS >
Adjustment
Tolerance?)
Threshold?)
AS-LEFT
Previous
Cumulative
Systematic Bias
[ (1)
Random
Component
(Norm Dist)
eDRIFT
eRAND
END PERIOD
eEOP =
eBIAS +
eDRIFT +
eRAND
=1
+ (1) ]
Adjustment
Performed
(Adjustment = -eOBS)
Result: eOBS = 0
OBS
Negative
= eBIAS (i-1)
Previous
Random
Component
= -eRAND (i-1)
BIAS
DRIFT
RAND
EOP
eOBS = Error Observed for the UUT, as-received. It is equal to the End-Of-Period error for the
previous calibration interval (eEOP (i-1) ). Only a portion of eOBS is due to systematic error (eBIAS (i-1) +
eDRIFT (i-1)). However, any adjustments are performed equal-and-opposite to the whole of eOBS,
which includes random error (eRAND (i-1)) in addition to systematic error (eBIAS (i-1) + eDRIFT (i-1)).
eBIAS = UUT attribute bias, as-left. If no adjustment has been made, eBIAS remains the same as the
sum of the systematic errors at the end of the previous calibration interval (eBIAS (i-1) + eDRIFT (i-1)). If
an adjustment is made, eBIAS is equal to the negative of the previous random error (-eRAND (i-1)). After
adjustment, eBIAS is zero only if the random error during the previous cal interval (eRAND (i-1)) was
zero (unlikely). Adjustment actions will always negate previously accumulated attribute bias,
but will also result in attribute bias of their own, due to an overcompensated adjustment.
eDRIFT = Error of UUT attributable to monotonic drift. If no adjustment is made, this systematic
drift error carries over or accumulates from one calibration interval to the next. For the model,
eDRIFT is specified as a percentage of the allowable tolerance or accuracy specification. The
remainder of the specification is then allocated to eRAND as (100 % - Drift %).
eRAND = Error of UUT attributable to random behavior. A random number generator is used to
select eRAND from a normal Gaussian distribution. Ideally, no adjustment should be made to
compensate for this component. This is common-cause variation with an assumed period
significantly longer than the observation period during calibration. If all variation is random,
adjusting is equivalent to tampering with a system which may otherwise be in a state of
statistical control. It is analogous to moving the funnel in the Deming experiment.
eEOP = Error of UUT at End of Period (includes attribute bias, plus drift, plus random error).
2015 NCSL International Workshop & Symposium
6. Results
The results in Figure 10A and 10B were rendered via the Monte Carlo method to visually
investigate aspects of the Weiss-Castrup drift model with regard to adjustment thresholds. The
model in Figure 9 is repeated for 100 000 iterations and the number of Out-Of-Tolerance
instances for eOBS is tallied over the 105 cycles. The End-of-Period-Reliability is then computed as
EOPR = (105 OOTs)/105. This process is repeated ten times with the average taken to arrive at
a final simulated EOPR output, applicable to a specifically chosen pair of values in the model,
i.e. (1) the amount of monotonic drift and (2) the adjustment threshold. A 101 x 101 matrix of
EOPR values is then generated by looping the process in +1 % increments from 0 % to 100 %
for both the monotonic drift variable and the adjustment threshold variable. In total, ~1010 Monte
Carlo iterations are used in the generation of the matrix. This requires considerable
computational brute-force and consumed approximately 43 hours of CPU time running under
MS Windows 7 in Excel 2010 using an Intel CoreTM i5 4300 CPU clocked at 2.6 GHz. See
Appendix B for a discussion of using Excel for Monte Carlo methods.
The resulting multivariate matrix can then be plotted as a three dimensional surface plot (Figures
10A & 10B), with the EOPR values displayed on the vertical z-axis. The x-axis represents the
monotonic drift rate and the y- axis represents the adjustment threshold, from 0 % to ~100 %
each. This provides insight into the effects that these variables impart to EOPR, which is
arguably the most important quality metric for many calibration and metrology organizations.
Other important quality metrics, such as Test Uncertainty Ratio (TUR) and the Probability of
False Accept (PFA), are inextricably interrelated to the observed EOPR [22, 23].
Figure 10A. 3D surface plot of EOPR as a function of adjustment threshold and drift
2015 NCSL International Workshop & Symposium
Figure 10B. 3D surface plot of EOPR as a function of adjustment threshold and drift
It is important to bear in mind the nature of the x-axis, representing drift in Figures 10A and 10B.
As the amount of drift increases, the random behavior decreases, as assumed by this
particular model (see Table 1). Other modeling can be performed with different parametric
assumptions, e.g. where the random variation is held constant (or grows larger) in the presence of
increasing drift. Still other assumptions, such as zero drift and increasing random variation, e.g.
random-walk models, could be modeled. Such investigations would provide additional insight.
It should also be noted that here, the x-axis merely approaches 100 % drift (zero random error).
When drift is exactly 100 % of specification with zero random error, all adjustment thresholds
100 % result in 100 % EOPR. In that case, adjustments are always performed and they are
always perfect due to the absence of random error (assuming infinite TUR; see assumption #5
in Section 4).
Many implications exist from the resulting model in Figure 10A and 10B for the stated
assumptions. Perhaps the most significant commonality in all instances is that, as the calibration
adjustment threshold increases from 0 % to 100 % of specification, the EOPR remains constant
or decreases in all cases; it never increases. This is further illustrated in Figure 11.
Figure 11. EOPR as a function of adjustment threshold for various levels of drift
In Figure 11, note that for the case of purely random variation with zero drift (green line), the
EOPR is constant at 83.4 %, just as the Weiss and Deming model would predict when
adjustments are always made (i.e. adjustment threshold of 0 % of specification). However, it is
interesting to note that this 83.4 % EOPR does not improve as the adjustment threshold is
increased from 0 % (always adjust) towards 100 % of specification (adjust less frequently).
Why does an increase in EOPR (reduction in variability) not result, in this purely random case,
as the adjustment threshold increases from 0 % to 100 % (i.e. less frequent adjustments)? The
answer to this question can be elucidated if the scale of the adjustment threshold and y-axis are
extended beyond the 100 % of specification limit (OOT point). With the model constrained to a
maximum of 100 % adjustment threshold in the purely random case, adjustments will still be
made for all observed OOT conditions. Even though these adjustments occur less frequently than
the always-adjust scenario (0 % adjustment threshold), the magnitude of these less frequent
adjustments or tampering is always quite large. For purely random systems, these large but lessfrequent adjustments for observed OOT conditions ultimately result in the same outcome as the
Weiss and Deming model predict; i.e. they lead to the same increased variability (2 2 or 2)
and resulting lower EOPR (83.4 %), just as if adjustment or tampering was performed every
time.
If the adjustment threshold is increased to 500 % of specification (or more), and the simulation is
run again, a decrease in variability (from 2 to ) and resulting increase in EOPR (from 83.4 %
to 95 %) is indeed observed. However, the transition region where this phenomenon occurs is not
well-behaved (see Figure 12). That is, as the adjustment threshold is raised above 100 % of
specification, fewer and fewer adjustments are ever made. The probability of adjustment
2015 NCSL International Workshop & Symposium
becomes exceedingly low. However, when one of these very rare events does occur, triggering
an adjustment (after many thousands of iterations of the Monte Carlo simulation), the effect is
quite significant. Since it was presumed to be a random event, no adjustment should have been
made (even at 150 %, 200 %, 300 % of specification, or more). Adjusting such a large random
error imparts an equally large attribute bias, opposite in magnitude.
Figure 12. Monte Carlo modeled behavior for random errors w/ adjust thresholds >100 % of spec
If the Monte Carlo simulations are extended to include adjustment thresholds far above 100 % of
specification (>OOT), the EOPR behavior becomes somewhat erratic between 150 % and 270 %
of specification. It ultimately settles at the 95 % EOPR, just as if no adjustments were ever made,
because no adjustments are essentially ever made when the adjustment threshold is so large. The
repeatability of the Monte Carlo process is also poor in this transition region (even with 106
iterations) because the results of the simulation are highly sensitive to very improbable events.
After the adjustment threshold extends beyond ~270 % of specification (~5.5), adjustment
actions become so rare as to approach the never adjust scenario of the Deming funnel (rule #1)
where the variation is lowest. Under these circumstances, the EOPR settles at the original 95 %
containment probability of the purely random variation with respect to the 1.96 specification
limits.
This scenario will likely find little application in calibration laboratories. One would have to be
willing to not adjust instruments with observed errors >>100 % of specification (highly OOT).
The rationale for such decision would be to attribute all errors (regardless of how large) as purely
random events that would not remain if simply left alone and not adjusted. In reality, such large
errors may be much more likely to be true attribute bias resulting from special-cause variation
such as misuse, over-ranging, rough handling, etc. Analysis of historical data is of great benefit
when attempting to characterize such errors.
2015 NCSL International Workshop & Symposium
NASA Kennedy Space Center (KNPR 8730.1, Rev. Basic-1; 2003 to 2009, Obsolete)
At KSC, calibration intervals are adjusted to achieve an EOPR range of 0.85 to 0.95. [24]
8. Non-Adjustable Instruments
It should be noted that, in the presence of any amount of monotonic drift regardless of how
small, an adjustment will eventually have to be made or the attribute bias will ultimately exceed
the allowable specification. Indeed, the very practice of shortening an interval to increase EOPR
is somewhat predicated on some form of time-dependent mechanism increasing the magnitude of
possible errors, along with the ability to adjust (reduce) the attribute bias to or near zero.
For non-adjustable instruments, EOPR cannot generally be increased by shortening a calibration
interval via the same mechanism applicable to adjustable instruments. However, shortening the
calibration interval for non-adjustable instruments can still be beneficial in two ways.
1) An increase in EOPR can still result from shortening the calibration interval for nonadjustable instruments which exhibit a relatively small time-dependent mechanism for
transitioning to an OOT condition (e.g. low drift). This is true because more in-tolerance
calibrations will be performed prior to the occurrence of an OOT condition. Once a nonadjustable instrument incurs its first OOT condition, it cannot be adjusted back into
tolerance and has effectively reached the end of its service life at which point EOPR =
(#Calibrations - 1) / (#Calibrations). The shorter the interval, the more in-tolerance
calibrations will have been performed and the higher the EOPR will be. After the first OOT
event, the instrument must then be retired from service or the allowable tolerance must be
increased with consent from the end-user or charted values must be manually employed
via a Report of Test or Calibration Certificate. Such action should only be taken if no impact
will result to the application or process for which the instrument is employed.
2) Organizational benefits, other than increased EOPR, can also be realized through shortening
of calibration intervals for non-adjustable instruments. These benefits do not manifest as an
increase in EOPR, but rather in a reduction of the exposure to possible consequences
associated with an out-of-tolerance condition. For example, a working-standard resistor
(calibrated to a tolerance) may not be adjustable. An out-of-tolerance condition may
eventually arise from drift or even special-cause variation (over-power/voltage, mechanical
shock/damage, etc.) Shortening the calibration interval will provide no direct benefit to
EOPR via a reduction in errors through adjustment. However, since any OOT condition will
result in an impact assessment (reverse traceability) for all instruments calibrated by this
OOT resistor, a shorter calibration interval will reduce the amount of impact possible
assessments and risk exposure to product or process, providing benefits of a different nature.
9. Conclusions
Discretionary adjustment during calibration of in-tolerance equipment is not mandated by
national and international calibration standards ANSI/Z540.3 and ISO-17025, nor is adjustment
contained within the VIM definition of calibration. A model has been used here in an attempt to
describe the effect of various discretionary adjustment thresholds on in-tolerance instruments,
assuming a specific behavioral mode called the Weiss-Castrup drift model and under very
specific assumptions. These assumptions may not hold for many items of TM&DE. Other
alternative assumptions, where the domain of drift and random behavior simultaneously
comprise only a small percentage of the associated specification, may yield significantly
different results and are worthy of further investigation.
2015 NCSL International Workshop & Symposium
Using Monte Carlo methods, the effect of various discretionary adjustment thresholds on End Of
Period Reliability (EOPR) has been investigated for in-tolerance instruments under these specific
conditions. For the model and assumptions stated, it is shown that discretionary adjustments of
in-tolerance instruments can be beneficial in the presence of monotonic drift superimposed on
random variation. Under these conditions, the non-adjustment benefits of reduced variation
(increased EOPR), posed by the Weiss model and Deming funnel model, do not appear to
manifest between the 0 % and 100 % of specification adjustment thresholds. As the calibration
adjustment threshold increases from 0 % to 100 % of specification, the EOPR remains constant
or decreases in all cases; it never increases. Only after the adjustment threshold far exceeds
100 % of specification and effectively approaches the never-adjust scenario, are these benefits
realized for purely random behavior. Never adjusting items with any significant amount of
monotonic drift is not a viable option, as these instruments will rather quickly transition to an
OOT condition resulting from a true attribute bias due to drift.
The assumptions of the model may be idealized and unrealistic in the empirical world. Moreover,
it may be unlikely that the behavior of any instrument would be entirely restricted to only the
two change mechanisms accommodated by this model or the domain of magnitudes and/or
proportions of drift and random behavior restricted to the values modeled here. Many general
purpose TM&DE instruments may perform considerably better than their specifications would
imply. They may also be impacted by other behavioral characteristics and special cause events,
hindering the use of this model and of the use of linear regression as a prediction technique.
Random walk behavior, where the magnitude of the random variation () itself increases with
time may be more realistic in many cases. Under such random walk models, the probability of
OOT events increases with time, even in the absence of monotonic drift. Much opportunity for
continued investigations and research exists in this regard. However, the assumptions stated
herein, when combined with the Weiss-Castrup drift model, provide a rudimentary working
construct with which to glean useful insight into the effect of various adjustment thresholds for
in-tolerance instruments under a variety of systematic and random errors.
Many programmatic factors must be considered when implementing instrument adjustment
policies or thresholds, above and beyond the exclusive consideration of maximizing EOPR.
Instrument adjustment can increase expense to a company or calibration laboratory in that AsReceived data must be acquired prior to adjustment, and As-Left data must be taken after the
adjustment. The model presented here strives to encourage additional investigation while
providing program managers and metrology professionals with a tool to assist in the
establishment of instrument adjustment policies and to guide possible decision processes. Astute
policy makers will likely use a variety of tools, models, assumptions, and empirical data,
balancing many options and objectives, to achieve the most prudent adjustment policy for a
particular organization.
11. Bibliography
1. ANSI/NCSL Z540.3:2006. Requirements for the Calibration of Measuring and Test
Equipment. American National Standards Institute / NCSL International, 2006.
http://www.ncsli.org/I/i/p/z3/c/a/p/NCSL_International_Z540.3_Standard.aspx?hkey=7de8317116ff-416c-9182-94c8447fb300
2. NCSL Z540.3:2006 Handbook, Handbook for the Application of ANSI/NCSL Z540.3-2006
Requirements for the Calibration of Measuring and Test Equipment. American National
Standards Institute / NCSL International. 2006.
http://www.ncsli.org/I/i/p/zHB/c/a/p/Zhb1.aspx?hkey=572363f0-59e9-4817-8b65-ae6ba5d8ff24
3. ISO/IEC 17025:2005(E). General Requirements for the Competence of Testing and
Calibration Laboratories. International Organization for Standardization / International
Electrotechnical Commission. 2005.
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=39883
4. JCGM 200:2012 (ISO/IEC Guide 99-12:2007). International Vocabulary of Metrology
Basic and General Concepts and Associated Terms (VIM). Joint Committee for Guides in
Metrology - Working Group 2, 3rd Edition. 2008.
http://www.bipm.org/utils/common/documents/jcgm/JCGM_200_2012.pdf
5. U.S. Department of Health and Human Services, Food and Drug Administration, Form 483
Observation #11 Control of Inspection, Measuring, and Test Equipment, Commander S.
Creighton, Consumer Safety Officer. Issued to St. Jude Medical IESD, Sylmar CA. October 17,
2012.
http://www.fda.gov/downloads/aboutfda/centersoffices/officeofglobalregulatoryoperationsandpol
icy/ora/oraelectronicreadingroom/ucm328488.pdf
6. J. Bucher, Measure for Measure - Out of Sync. American Society for Quality (ASQ)
Measurement Quality Division, The Standard, Vol. 27 No. 2, PDF pp 21-22. June 2013.
http://rube.asq.org/measure/2013/05/the-standard-june-2013.pdf
6a. The preceding paper was also published in ASQ Quality Progress, pp 52-53. March 2010.
http://asq.org/quality-progress/2010/03/measure-for-measure/out-of-sync.html
7. J. Bucher, Debunking The Two Great Myths About Calibration: Traceability to NIST: If You
Cannot Adjust, You Cannot Calibrate. Proceedings of the NCSL International Workshop and
Symposium, National Harbor MD. Aug 2011. https://www.ncsli.org/c/f/p11/48.299.pdf
8. J. Bucher., Where Does It Say That? Clearing Up the FDAs Calibration Requirements.
American Society for Quality, Measurement Quality Division, The Standard, Vol. 27 No. 2, PDF
pp 31-32. June 2013. http://rube.asq.org/measure/2013/05/the-standard-june-2013.pdf
8a. The preceding paper was also published in ASQ Quality Progress, pp 50-51. November 2010.
http://asq.org/quality-progress/2010/11/measure-for-measure/where-does-it-say-that.html
9. NCSL RP-1:2010, Recommended Practice: Establishment and Adjustment of Calibration
Intervals, NCSL International, Boulder CO. 2010.
http://www.ncsli.org/I/i/Store/rp/iMIS/Store/rp.aspx?hkey=bf3e3957-f502-484d-9842fa5ef6325073
10. B. Weiss, Does Calibration Adjustment Optimize Measurement Integrity?. Proceedings of
the National Conference of Standards Laboratories Workshop and Symposium, Albuquerque
NM. August 1991. http://legacy.library.ucsf.edu/tid/jlw43b00/pdf
11. D. Shah, Deming Funnel Experiment and Calibration Over Adjustment: New Innovation?
American Society for Quality, ASQ World Conference on Quality and Improvement
Proceedings, Vol. 61, Orlando FL. April 2007. http://asq.org/qic/display-item/?item=21074
12. S. Prevette, Dr. Demings Funnel Experiment, Symphony Technologies Pvt Ltd, Rule 2
Example Periodic Calibrations, Pune India.
www.symphonytech.com/articles/pdfs/spfunnel.pdf
13. D. Abell, Do You Really Need a 17025 Accredited Calibration?. Proceedings of the NCSL
International Workshop and Symposium, Tampa FL. August 2003.
14. G Payne, Measure for Measure: Calibration: What Is It? ASQ Quality Progress, American
Society for Quality. pp 72-76. May 2005.
http://asq.org/quality-progress/2005/05/measure-for-measure/calibration-what-is-it.html
15. D. Jackson, Calibration Intervals New Models and Techniques, Naval Surface Warfare
Center Corona Division, Proceedings of the Measurement Science Conference, Anaheim CA.
January 2002.
16. C. Hamilton, Y. Tang, Evaluating the Uncertainty of Josephson Voltage Standards.
Metrologia Vol. 36 No. 1, pp 53-58. February 1999.
https://www.researchgate.net/profile/Y_Tang2/publication/231103850_Evaluating_the_uncertainty_of_Jo
sephson_voltage_standards/links/54abe6cf0cf25c4c472fb877.pdf
21a. A version of the preceding paper was also published in NCSLI Measure The Journal of
Measurement Science. Vol. 5 No. 3, pp 68-73. September 2010.
http://www.keysight.com/upload/cmc_upload/All/Setting_Using_Specifications.pdf
22. P. Reese, J. Harben. Implementing Strategies for Risk Mitigation in the Modern Calibration
Laboratory, Proceedings of the NCSL International Workshop and Symposium, National Harbor
MD. August 2011.
https://www.researchgate.net/profile/Paul_Reese2/publication/258311599_Implementing_Strategies_for_
Risk_Mitigation_In_the_Modern_Calibration_Laboratory/file/e0b49527c29c941b2e.pdf
23. P. Reese, J. Harben. Risk Mitigation Strategies for Compliance Testing, Measure The
Journal of Measurement Science, NCSL International Vol.7, No.1 pp 38-49. March 2012
https://www.researchgate.net/profile/Paul_Reese2/publication/258311819_Risk_Mitigation_Strategies_fo
r_Compliance_Testing/file/e0b49527c24ec50664.pdf
24. NASA KNPR 8730.1 (Rev. Basic-1), Kennedy NASA Procedural Requirements, Section 3.3
Calibration Intervals, pp 11. National Aeronautics and Space Administration. March 2003.
25. Navy OPNAV 3960.16A, Navy Test, Measurement, and Diagnostic Equipment (TMDE),
Automatic Test Systems (ATS), and Metrology and Calibration (METCAL), Section 6 Policy,
U.S. Navy, paragraph (o), pp 6. August 2005.
http://doni.daps.dla.mil/Directives/03000%20Naval%20Operations%20and%20Readiness/03900%20Research,%20Development,%20Test%20and%20Evaluation%20Services/3960.16A.pdf
26. Albright J., Thesis: Reliability Enhancement of the Navy Metrology and Calibration
Program, Naval Postgraduate School, Monterey CA. December 1997.
https://calhoun.nps.edu/bitstream/handle/10945/8906/reliabilityenhan00albr.pdf
27. USAF TO 00-20-14, Technical Manual Air Force Metrology and Calibration Program.
Section 3.4 Calibration Intervals, pp 3-8. Secretary of the United States Air Force, September
2011. www.wpafb.af.mil/shared/media/document/AFD-120724-063.pdf
28. GAO LCD-77-427, B-160682, A Central Manager is Needed to Coordinate the Military
Diagnostic and Calibration Program. Appendix I Different Criteria Used To Establish
Calibration Intervals at Metrology Centers, U.S. General Accounting Office, pp 1-2, May 1977.
http://gao.justia.com/national-aeronautics-and-space-administration/1977/5/a-central-manager-is-neededto-coordinate-the-military-diagnostic-and-calibration-program-lcd-77-427/LCD-77-427-full-report.pdf
29. AR-750-43, Maintenance of Supplies and Equipment Army Test, Measurement, and
Diagnostic Equipment, Chapter 6, Section I Program Objectives and Administration,
Paragraph 6-1a Program Objectives, pp 24. Department of Defense, U.S. Army. Jan 2014.
http://www.apd.army.mil/pdffiles/r750_43.pdf
30. B. McCullough, B. Wilson. On the Accuracy of Statistical Procedures in Microsoft Excel
97. Computational Statistics & Data Analysis. Vol. 31 No. 1, pp 27-37. July 1999.
http://users.df.uba.ar/cobelli/LaboratoriosBasicos/excel97.pdf
31. L. Knsel. On the accuracy of the statistical distributions in Microsoft Excel 97.
Computational Statistics & Data Analysis. Vol. 26 No. 3, pp 375-377. January 1998.
http://www.sciencedirect.com/science/article/pii/S0167947397817562
47a. A long version of the preceding paper (w/software and results from BigCrush TestU01) is
available from NPL with margin notes and additional appendices regarding implementation of
the enhanced 4-cycle Wichmann-Hill PRNG.
http://www.npl.co.uk/science-technology/mathematics-modelling-and-simulation/mathematics-andmodelling-for-metrology/mmm-software-downloads
48. T. Symul, S. Assad, P. Lam. Real Time Demonstration of High Bitrate Quantum Random
Number Generation with Coherent Laser Light. Applied Physics Letters. Vol. 98 No. 23. June
2011. http://arxiv.org/pdf/1107.4438.pdf http://photonics.anu.edu.au/qoptics/Research/qrng.php
49. A. Yee, S. Kondo. 12.1 Trillion Digits of Pi, And Were Out of Disk Space... December 2013.
http://www.numberworld.org/misc_runs/pi-12t/
50. F. Panneton, P. LEcuyer, M. Matsumoto. Improved Long-Period Generators Based on
Linear Recurrences Modulo 2. ACM Transactions on Mathematical Software. Vol. 32 No. 1, pp
1-16. March 2006. http://www.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf
51. JCGM:101 2008. Evaluation of Measurement Data Supplement 1 to the Guide to the
Expression of Uncertainty in Measurement - Propagation of Distributions Using a Monte Carlo
Method. Joint Committee for Guides in Metrology. Working-Group 1. First Edition, 2008.
http://www.bipm.org/utils/common/documents/jcgm/JCGM_101_2008_E.pdf
52. A. Steele, R. Douglas. Simplifications from Simulations: Monte Carlo Methods for
Uncertainties. NCSLI Measure The Journal of Measurement Science. Vol. 1 No. 2, pp 56-68.
June 2006. http://www.ncsli.org/I/mj/dfiles/NCSLI_Measure_2006_June.pdf
52a. A version of the preceding paper was also published in the 2005 Proceedings of the NCSL
International Workshop & Symposium, Washington D.C. August 2005.
53. P. Reese. Personal communications with P. LEcuyer & R. Simard via email. April 2015.
54. H. Castrup. Calibration Requirements Analysis System. Proceedings of the 1989 NCSL
Workshop and Symposium, Denver CO. July 1989.
http://www.isgmax.com/articles_papers/ncsl89.pdf
55. NCSLI LM-5. Laboratory Management Publication: Benchmark Survey - 2007. Sponsored
by Boing Company. NCSLI International 182 Benchmarking Programs Committee. Boulder
CO. 2007. http://www.ncsli.org
56. AIAG MSA-4. Measurement Systems Analysis Reference Manual. Automotive Industry
Action Group (AIAG) MSA Work Group. Chrysler Group LLC, Ford Motor Company, General
Motors Corporation. ISBN 978-1-60-534211-5. Fourth Edition. June 2010.
http://www.aiag.org/source/Orders/prodDetail.cfm?productDetail=MSA-4
2015 NCSL International Workshop & Symposium
57. ISO/TS 16949:2009. Quality management systems -- Particular Requirements for the
Application of ISO 9001:2008 for Automotive Production and Relevant Service Part
Organizations. International Organization for Standardization. 2009.
http://www.ts16949.com/a55aeb/ts16949.nsf/layoutB/Home+Page?OpenDocument
58. T. Nolan, P. Provost. Understanding Variation. ASQ Quality Progress. American Society for
Quality. Vol. 23 No. 5. May 1990. http://www.apiweb.org/UnderstandingVariation.pdf
59. J. Bucher. The Quality Calibration Handbook: Developing and Managing a Calibration
Program. American Society for Quality, Quality Press. ISBN-13: 978-0-87389-704-1. 2007.
http://asq.org/quality-press/display-item/?item=H1293
60. J. Bucher. The Metrology Handbook. American Society for Quality, Measurement Quality
Division. ASQ Quality Press. ISBN 0-87389-620-3. 2004.
http://asq.org/quality-press/display-item/?item=H1428
APPENDIX A
Monte Carlo Modeling of EOPR
For Various Adjustment Thresholds Under Drift and Random Variation
Figure A1. Example of first ten iterations of the Monte Carlo simulation
Conditions:
Drift set to 10 % of Specification (Random: = 50.038 % of Spec).
Adjustment Threshold set to 80 % of Spec.
To facilitate a step-by-step understanding of the model, the first 10 iterations are shown in Figure
A1, the first five of which are described in detail below.
Iteration #1.
Initial Conditions are set to 0 % observed error and 0 % attribute bias. Since the observed error
(0 %) is less than 100 % of spec, the UUT is declared In-Tolerance. Since the observed error (0
%) is also less than the adjustment threshold of 80 %, no discretionary adjustment is made. The
as-left attribute bias is 0 % of specification. The UUT is returned to the customer. During the
course of this calibration interval, a random error associated with a normal distribution manifests
(+81.2 % of Spec). Additionally, a systematic drift error also manifests (+10 % of spec). These
two errors are additive, resulting in a net error of +91.2 % of spec. At the end of the calibration
interval (End-of-Period), the observed error observed for the UUT is +91.2 % of spec. Note that
only 10 % of this error is due to systematic drift. Thus, the true bias error of the UUT is only
+10 %. The additional +81.2 % error arose from a random error. If a proper adjustment was to be
made at the end of this interval, only a -10 % adjustment should be made to correct only the
systematic attribute bias due to the drift over this calibration interval.
Iteration #2. The UUT is received with an observed error of +91.2 % of spec. It is not known to
the calibration technician how much of the observed +91.2 % error is due to drift (bias) and how
much is due to random behavior. Since this error is less than 100 % of spec, the UUT is declared
In-Tolerance. However, since the observed error of +91.2 % is also greater than the adjustment
threshold of 80 %, a discretionary adjustment is made. The technician makes an adjustment of 91.2 % in an attempt to correct for the observed error. A proper adjustment would have only
been -10 % to compensate only for the cumulative systematic drift over the first interval. But this
is not possible since the only information available at the time of adjustment is the observed error
of +91.2 %. Thus, the adjustment overcompensates by -81.2 % and the UUT is returned to the
customer with an actual attribute bias of -81.2 %. During the course of this calibration interval, a
random error associated with a normal distribution manifests (+7.5 % of spec). Additionally, a
systematic drift error also manifests (+10 % of spec). These two errors are additive, resulting in a
net error of +17.5 % of spec. However, the previous calibration adjustment left the UUT with a 81.2 % systematic bias. Therefore, this pre-existing -81.2 % attribute bias is also added to the
+17.5 % error, resulting in a net observed error at the End-Of-Period of -63.7 %. Note that only a
-71.2 % error is due to systematic effects (-81.2 % from the overcompensated adjustment and
+10 % drift from the second interval). If a proper adjustment was to be made at the end of this
interval, only a +71.2 % adjustment should be made to correct exclusively for the systemic
attribute bias due to the overcompensated adjustment and the drift during this second interval.
Iteration #3. The UUT is received with an observed error of -63.7 % of spec. It is not known to
the calibration technician how much of the observed -63.7 % error is due to systematic effects
(bias) and how much is due to random behavior. Since the observed error is less than 100 % of
spec, the UUT is declared In-Tolerance. Moreover, the observed error of -63.7 % is less than the
adjustment threshold of 80 %; therefore, no discretionary adjustment is made. The UUT is
returned to the customer with an actual attribute bias of -71.2 %. During the course of this
calibration interval, a random error associated with a normal distribution manifests (+44.8 % of
spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are
additive, resulting in a net error of +54.8 % of spec. However, the previous calibration left the
UUT with a -71.2 % systematic bias. Therefore, this pre-existing -71.2 % attribute bias is added
to the +54.8 % error, resulting in a net observed error at the End-Of-Period of -16.5 %. Note that
only a -61.2 % error is due to systematic effects (-71.2 % and an additional +10 % drift from this
third interval). If a proper adjustment was to be made at the end of this interval, only a +61.2 %
adjustment should be made to correct exclusively for the systemic attribute bias.
Iteration #4. The UUT is received with an observed error of -16.5 % of spec. It is not known to
the calibration technician how much of the observed -16.5 % error is due to systematic effects
(bias) and how much is due to random behavior. Since the observed error is less than 100 % of
spec, the UUT is declared In-Tolerance. Moreover, since the observed error of -16.5 % is less
than the adjustment threshold of 80 %, a discretionary adjustment is not performed. The UUT is
returned to the customer with an actual attribute bias of -61.2 %. During the course of this
calibration interval, a random error associated with a normal distribution manifests (-39.7 % of
spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are
additive, resulting in a net error of -29.7 % of spec. However, the calibration adjustment left the
UUT with a -61.2 % systematic bias. Therefore, this pre-existing -61.2 % attribute bias is added
to the -29.7 % error, resulting in a net observed error at the End-Of-Period of -90.9 %. Note that
the attribute bias is only -51.2 % due to systematic effects (-61.2 % and an additional +10 %
drift from this fourth interval. If a proper adjustment was to be made at the end of this interval,
only a +51.2 % adjustment should be made to correct exclusively for the systemic attribute bias.
Iteration #5. The UUT is received with an observed error of -90.9 % of spec. It is not known to
the calibration technician how much of the observed -90.9 % error is due to systematic effects
(bias) and how much is due to random behavior. Since the observed error is less than 100 % of
spec, the UUT is declared In-Tolerance. However, since the observed error of -90.3 % is greater
than the adjustment threshold of 80 %, a discretionary adjustment is made. The technician
makes an adjustment of +90.9 % in an attempt to correct for the observed error. A proper
adjustment would have only been +51.2 % to compensate only for the systematic attribute bias.
But this is not possible since the only information available at the time of adjustment is the
observed error of -90.9 %. Thus, the adjustment overcompensates by +39.7 % and the UUT is
returned to the customer with an actual attribute bias of +39.7 %. During the course of this
calibration interval, a random error associated with a normal distribution manifests (+115.8 % of
spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are
additive, resulting in a net error of +125.8 % of spec. However, the previous calibration left the
UUT with a +39.7 % systematic bias from the previous adjustment. Therefore, this pre-existing
+39.7 % attribute bias is added to the +125.8 % error, resulting in a net observed error at the
End-Of-Period of +165.5 %. Note that the attribute bias is only +49.7 % due to systematic
effects (+39.7 % bias from the previous adjustment, and another +10 % drift from this fifth
interval). If a proper adjustment was to be made at the end of this interval, only a -49.7 %
adjustment should be made to correct exclusively for the systemic attribute bias due to the
overcompensated adjustment and the drift over this fifth calibration interval. The UUT will
arrive in the calibration lab at the beginning of the 6th iteration with an actual attribute bias of 49.7 %, but with an observed error of +165.5 %.
The cyclic process described above is repeated for 100 000 iterations and the EOPR is computed.
The 100 000 iteration cycle is repeated 9 more times and the average of the ten EOPR values is
taken as the final estimate of one EOPR value for use in the 101 x 101 matrix. This entire
process is then repeated 10 200 times (1.02 x 1010 total iterations) to complete the matrix, shown
in Figures 10A and 10B.
APPENDIX B
On the Use of Microsoft Excel for Monte Carlo Methods
The use of Microsoft Excel as a serious scientific platform for statistical analysis has many
detractors as well as a long history of critique by the statistical community [30-39]. However, it
remains one of the most widely used of all software tools in current use. As of 2010, an
estimated 750 million copies have been installed as part of the MS Office suite [40].
Excel may arguably be described by the principle of Maslows Hammer, often stated as, If all
you have is a hammer, everything looks like a nail. Excel is undoubtedly utilized in many
situations where a more appropriate or efficient tool exists. Yet, this observation alone does not
preclude Excels utility in a wide array of diverse applications. The flexibility and ubiquitous
nature of Excel may be more analogous to a Swiss Army Knife than a hammer. It may not be the
best tool for any job, but it can be an acceptable tool for many jobs, especially when
precautionary measures are taken to ensure acceptable performance. Given the Visual Basic for
Applications (VBA) programming environment in Excel, it can be a powerful option.
Like any software, it should be confirmed via objective evidence that Excel will provide
trustworthy, accurate results with an acceptable degree of confidence. This gives rise to
validation requirements in some critical applications to ensure that computations are being
performed correctly with an acceptable degree of accuracy. In this regard, Excel is no different
than any other software package. Its built-in functions, user-defined functions, logic, equations,
etc. should be validated to the extent necessary to satisfy applicable requirements. NIST provides
a Statistical Reference Database to aid in such evaluations [41].
Excel 2010
Many, but not all, of the historical criticisms regarding Excels suitability for statistical analysis
have been addressed and largely rectified with Excel 2010 [38-39]. Melard [38] has evaluated
Excel 2010s Pseudo Random Number Generator (PRNG), implemented as the RAND()
function. The RAND() function is designed to return values uniformly distributed over the range
of [0,1). Melard has shown the RAND() function in Excel 2010 to pass most modern statistical
tests for randomness, specifically a modified version of the Crush test suite in the TestU01
library by LEcuyer and Simard [42]. TestU01 has essentially superseded older series of RNG
tests, e.g. Diehard tests of Marsaglia [43] and offers a challenging battery of tests for any PRNG.
Had Melard chosen to invoke the most rigorous test suite of the TestU01 library for testing the
RAND() function in Excel 2010, called BigCrush, it would have required a very large test file of
random numbers from Excel, roughly 3 TB in size; thus, BigCrush was not performed. As it was,
the smaller 412 GB Crush test-file (~235 numbers) took two weeks to generate and 36 hours of
CPU time to run the actual Crush tests. He concludes, All tests are passed except Periods in
Strings with r = 15 and s = 15 for which the p-value is 8.107. Melard attributes these
anomalies to his specific approach in generating the test file to manage its size. Additionally,
Melard references a semi-official indication that the Mersenne Twister algorithm known as
MT19937 has been implemented for the Excel 2010 RAND() function and is assumed to be
responsible for the improved performance, compared with previous versions of Excel.
The Mersenne Twister (MT19937) is a somewhat modern pseudo random number generator
published in 1998 by Matsumoto and Nishimura [44]. It is now available in many mathematical
programming packages, e.g. MATLAB, Maple, R, GAUSS, SAS, SPSS, Ruby, Python, Julia,
Visual C++, etc. and presumably in Excel 2010. However, MT19937 has been shown to fail two
tests for linear complexity (r=1 and r = 29, where p <10-15) in the extensive BigCrush suite of
tests using version 1.0 of TestU01 [38, 42]. Conflictingly, other authors report that MT19937
passes all tests in BigCrush using version 1.1 [38, 45] and version 0.6.04 [47] of TestU01.
However, LEcuyer and Simard (the authors of TestU01) have confirmed that MT19937 does
indeed fail recent TestU01 tests for linear complexity and that it is well-understood why this
occurs [53]. LEcuyer et al have published that MT19937, ...successfully passed all the
statistical tests included in BigCrush of TestU01, except those that look for linear
dependencies in a long sequence of bits, such as the linear complexity tests This is in fact a
limitation of all f2-linear generators, including the Mersenne Twister Because of their linear
nature, the sequences produced by these generators just cannot have the linear complexity of a
truly random sequence. This is definitely unacceptable in cryptology but is quite acceptable
for the vast majority of simulation applications[50].
Good PRNGs are essential in order for Monte Carlo methods to yield accurate results. The same
is true for the probability distributions used in the simulations, e.g. the normal probability density
function and its inverse. In the aforementioned paper by Melard [38], the accuracy of the Excel
2010 NORM.INV function, along with many other probability distributions was also tested with
positive results. Improvements over Excel 2003 and 2007 were noted and its accuracy was on par
with other statistical applications. Melard states, On the basis of these results, Microsoft Excel
2010 appears as good as OpenOffice.org Calc 3.3. He continues, To conclude, most of the
problems of Excel raised by Yalta (2008) were corrected in the 2010 version. Regarding the
NORM.S.DIST and NORM.DIST functions of Excel 2010, Knsel [39] additionally notes, No
errors were found with these two functions5 and states Most of the errors in Microsoft Excel 97
and Excel 2003 pointed out in my previous papers have been eliminated in Excel 2010.
The preceding evidence suggests that two critical aspects of Monte Carlo simulations are
satisfied by Excel 2010, i.e. accurate statistical probability density functions and a robust random
number generator. As such, Excel 2010 may be a viable tool for investigations such as those
presented in this paper. In addition, loping functions in VBA can be used to readily process
Monte Carlo simulations in Excel. When doing so, it is sometimes helpful to change the formula
calculation option for workbooks from its default setting of automatic to manual, and then to
embed the VBA code to perform calculations (e.g. calculate) into the loop itself. Also, turning
off screen updating Application.ScreenUpdating = False can greatly reduce the time required
to perform long sequences of Monte Carlo iterations in Excel.
4
Wichmann & Hill in 2006 [47] report MT19937 passes all BigCrush tests in Version 6.0 of TestU01 (dated Jan 14
2005). However, Simard has confirmed this must have actually been version 0.6.0 (pre-official-release) where the
Linear Complexity tests used all of the first 30 bits; MT19937 would indeed pass this [53]. Later official versions of
TestU01 have two linear complexity tests that use the 1st and 30th bit of each random number, which MT19937 fails
[42, 50, 53] using version 1.0 and later. It is unknown why passing results for all BigCrush tests in Ver 1.1 of
TestU01 were reported by McCullough [45]. The current version of TestU01 is 1.2.3, dated 18 August 2009 [42].
5
Knsel does report [negligible] errors in Excel 2010s NORM.S.INV and NORM.INV functions at extremely
small probabilities (p-values) <2.2251 x 10-308.
2015 NCSL International Workshop & Symposium
In addition to the deterministic nature of pseudo random number generators, PRNGs do not
have infinite periods. At some point, the stream of output numbers will begin to repeat itself; a
replicate pattern will eventually emerge. Short periods are undesirable. The original 3-cycle
Wichmann-Hill PRNG (1982) has a period of ~1013 [46, 47]. This is relatively small by todays
standards and this older PRNG also fails some BigCrush tests in TestU01. A revised/enhanced
4-cycle Wichmann-Hill PRNG (2006) has a period of ~1036 [47], adequate for most any
application imaginable. Moreover, the enhanced 4-cycle Wichmann-Hill PRNG has been shown
to pass all BigCrush tests in version 0.6.06 of TestU01 [47, 47a]. It has many other desirable
properties as well and requires only 26 lines of code to implement in C; it could also be
implemented in Excel using VBA. It is regarded as a highly robust PRNG and is referenced in
Annex C of JCGM 101:20087 (GUM Supplement 1) for computing measurement uncertainty via
Monte Carlo methods [51]. Although the aforementioned MT19937 algorithm fails two
BigCrush tests in more recent versions of TestU01, it has an extremely long period of ~219937
[44] or ~106001. To fully appreciate the length of these periods, consideration of the following
large numbers is insightful:
The age of the universe is ~4.4 x 1017 seconds (13.8 billion years).
The fastest supercomputers approach ~3 x 1016 floating point operations per second.
The number of atoms in the observable universe is ~1080.
The number of Plank volumes in the observable universe is ~10185.
A paper in 2006 by Steele and Douglas8 [52, 52a] also provides a wealth of practical information
and useful insights for performing Monte Carlo simulations in Excel. While focused on
computing measurement uncertainties in Excel, the paper illustrates the usefulness of the VBA
programming environment for implementing alternative (custom) pseudo random number
generators. The authors provide VBA code for the 1982 Wichmann-Hill PRNG and include
step-by-step instructions for writing custom user-defined VBA functions. Advanced users are
referred to external Dynamic Link Libraries (DLLs) to facilitate faster execution of compiled
code in C (such as PRNGs) within Excel. Also identified in the paper are the limitations of
the 1982 Wichmann-Hill generator along with reference to a PRNG called RANLUX which
offers higher dynamic range as well as other beneficial characteristics (the authors offer to
provide VBA code for RANLUX as well as additional helpful resources). It should be noted that
the Steel and Douglas paper was written prior to the publication of the enhanced Wichmann-Hill
4-cycle PRNG (2006), prior to the final release of JCGM 101:2008 (GUM Supplement One), and
prior to the advent of Excel 2010. Nevertheless, the paper remains an excellent resource for the
researcher wishing to investigate Monte Carlo methods in Excel.
Reported by Wichmann & Hill [47] as version 6.0 of TestU01. See preceding footnote 3 regarding version 0.6.0.
JCGM 101:2008 does not exclusively recommend any particular PRNG over others. It states, Generators other
than those given in this annex can be used. Their statistical quality should be tested before use. [51]
8
This paper was also presented by Dr. Alan Steele on August 9 th at the 2005 NCSL International Workshop and
Symposium in Washington D.C., for which it won Best Paper award in Theoretical Metrology [52a].
7