Sie sind auf Seite 1von 44

Instrument Adjustment Policies

Speaker/Author: Paul Reese


Baxter Healthcare Corporation
25212 West Illinois Route 120
Mail Stop: WG2-2S
Round Lake, IL 60073
Phone: (224) 270-4547 Fax: (224) 270-2491
E-mail: paul_reese@baxter.com
Abstract
Instrument adjustment policies play a key role in the reliability of calibrated instruments to
maintain their accuracy over a specified time interval. Periodic review and adjustment of
assigned calibration intervals is required by national standard ANSI/NCSL Z540.3 and is
employed to manage the End of Period Reliability (EOPR) to acceptable levels. Instrument
adjustment policies may also be implemented with various guardband strategies to manage false
accept risk. However, policies and guidance addressing the routine adjustment of in-tolerance
instruments are not so well established. National and international calibration standards
ANSI/NCSL Z540.3 and ISO/IEC-17025 do not mandate any particular adjustment policy with
regard to in-tolerance equipment. Evidence has been previously presented where routine
adjustment of in-tolerance items may even degrade performance. Yet, this important part of the
overall calibration process is often left to the discretion of the calibrating technician based on
heuristic assessment. Astute adjustment decisions require knowledge of the random vs.
systematic nature of instrument error. Instruments dominated by systematic effects, such as drift,
benefit from adjustment, while those displaying more random behavior may not. Monte Carlo
methods are used here to investigate the effect of various adjustment thresholds on in-tolerance
instruments.
1. Background
Instrument adjustment policies during calibration vary among different organizations. Such
policies can generally be classified into one of three categories:
1) Adjust always
2) Adjust only if Out-Of-Tolerance (OOT)
3) Adjust with discretion when In-Tolerance and always when OOT
While the first two polices are essentially self-explanatory, the third category deserves further
attention. Herein, a discretionary adjustment is one in which the calibration technician (or
software) makes a decision to adjust an instrument, which is observed to be in-tolerance, based
on consideration of additional factors. Discretionary adjustment may sometimes be performed in
conjunction with guardbanding strategies to mitigate false-accept-risk. Guardbanding techniques
often require discretionary adjustments to be made where low Test Uncertainty Ratio (TUR)
and/or End Of Period Reliability (EOPR) encountered. Significant literature exists on this subject
[23-23].

2015 NCSL International Workshop & Symposium

However, this paper endeavors to provide an investigation into discretionary adjustments of intolerance instruments which are made, not to mitigate false accept risk, but as a preemptive
measure in an attempt to reduce the potential for future out-of-tolerance (OOT) conditions. A
reduction in OOT probability can translate into improved EOPR reliability. Such adjustments are
often made on the bench at the discretion of the calibration technician when the observed error
is deemed too close to the tolerance limits. Organizations sometimes may have a blanket policy
or threshold in place that defines, in a broad general sense, what too close is. This adjustment
threshold may be 70 % of specification, 80 % of specification, or any other arbitrary value. The
intent of this policy may be to improve accuracy and mitigate future OOT conditions, improving
EOPR. The objective of this paper is to investigate whether such adjustments can, in fact,
provide an increase in accuracy and a reduction in OOT probability (increased EOPR) and, if so,
by how much and under what conditions. The possibility of calibration adjustments unwittingly
degrading performance is also investigated.
There are no national or international standards which dictate or require adjustment during
calibration, unless an instrument is found OOT or the observed error fails to meet guardband
criteria. ANSI/NCSL Z540.3-2006 and ISO/IEC-17025:2005 do not mandate discretionary
adjustment of in-tolerance items [1 - 3]. The International Vocabulary of Metrology (VIM)
clearly defines calibration, verification, and adjustment as separate actions [4]. Adjustment is not
a defacto aspect of calibration. As defined by the VIM:
Calibration: Operation that, under specified conditions, in a first step, establishes a relation
between the quantity values with measurement uncertainties provided by measurement standards
and corresponding indications with associated measurement uncertainties and, in a second step,
uses this information to establish a relation for obtaining a measurement result from an
indication NOTE 2: Calibration should not be confused with adjustment of a measuring
system, often mistakenly called self-calibration, nor with verification of calibration.
Adjustment of a measuring system: Set of operations carried out on a measuring system so that
it provides prescribed indications corresponding to given values of a quantity to be measured
NOTE 2: Adjustment of a measuring system should not be confused with calibration, which is a
prerequisite for adjustment.
Verification: Provision of objective evidence that a given item fulfils specified requirements
EXAMPLE 2: Confirmation that performance properties or legal requirements of a measuring
system are achievedNOTE 3: The specified requirements may be, e.g. that a manufacturer's
specifications are met NOTE 5: Verification should not be confused with calibration.
Despite these established definitions, there have been recent accounts where entities regulated by
the Food and Drug Administration (FDA) have received Form-483 Investigational Observations
and Warning Letters arising from the failure to always adjust in-tolerance instruments (i.e. all
instruments) during calibration [5]. These incidents may be attributable to a nebulous distinction
between the definitions of calibration, verification, and adjustment. References to similar events
in regulated industries have also been published [6 - 8] where calibration requirements have
been inferred to mandate adjustment during calibration.

2015 NCSL International Workshop & Symposium

Consistent with the VIM definitions, a calibration, where a pass/fail conformance decision is
made, also satisfies the definition of a verification. However, the converse is not true; not all
verifications are calibrations. This distinction is important because, for example, not all
calibrations result in a pass/fail conformance decision being issued. Such is the case for most
calibrations performed by National Metrology Institutes (NMI) and some reference standards
laboratories where calibrations are routinely performed and no pass/fail conformance decision is
made. The definition of calibration requires no such conformance decision be rendered. In these
cases, calibration consists of the measurement data reported along with the measurement
uncertainty. Such operations still adhere to the VIM definition of calibration, but they are not
verifications, since no statement of conformance to metrological specifications is given.
However calibrations, which do result in a statement of conformance (i.e. pass/fail) with respect
to an established metrological specification, are also verifications. In such scenarios, the
definitions of calibration and verification are both applicable. However, the absence of
adjustment of a measuring system during calibration in no way negates or disqualifies the
proper usage of the term calibration. Many instruments do not lend themselves to adjustment
and are not designed to be physically or electronically adjusted to periodically nominalize their
performance for the purpose of reducing measurement errors; yet, such instruments are still quite
capable of being calibrated. The distinction is readily apparent as indicated by ANSI/NCSL
Z540.3-2006 section 5.3a and 5.3b shown below [1, 2].
5.3 Calibration of Measuring and Test Equipment
a) Where calibrations provide for reporting measured values, the measurement uncertainty
shall be acceptable to the customer and shall be documented.
b) Where calibrations provide for verification that measurement quantities are within specified
tolerances, the probability that incorrect acceptance decisions (false accept) will result from
calibration tests shall not exceed 2 % and shall be documented. Where it is not practicable
to estimate this probability, the test uncertainty ratio shall be equal to or greater than 4:1.
2. NCSLI RP-1: Establishment and Adjustment of Calibration Intervals
As stated, discretionary adjustments of in-tolerance instruments are often left to the judgment of
the calibration technician, or governed by organizational policy. When deferred to the discretion
of the technician, such adjustments are optimally based on professional evaluation by qualified
personnel with experience and training in the metrological disciplines for which they are
responsible. Heuristic assessment of instrument adjustment requirements, combined with
empirical data and epistemological knowledge gathered over multiple calibration operations may
provide a somewhat intuitive qualitative notion of when adjustment might be beneficial.
However, there is little formal quantitative guidance on this subject. The most authoritative
reference on such discretionary adjustments is found in NCSLI Recommended Practice RP-1,
Establishment and Adjustment of Calibration Intervals, henceforth referred to as NCSLI RP-1
[9]. Appendix G of NCSLI RP-1 refers to three adjustment policies as
1) Renew-always
2) Renew-if-failed
3) Renew-as-needed
2015 NCSL International Workshop & Symposium

NCSLI RP-1 employs the term renew to convey an adjustment action. Herein, the renew-asneeded policy is synonymous with discretionary adjustment. As stated in RP-1 [9],
At present, no inexpensive systematic tools exist for deciding on the optimal renewal policy for
a given MTE. While it can be argued that one policy over another should be implemented on an
organizational level, there is a paucity of rigorously demonstrable tests that lead to a clear-cut
decision as to what that policy should be. The implementation of reliability models, such as the
drift model, that yield information on the relative contributions of random and systematic effects,
seems to be a step in the right direction.
The objective of this paper is to provide some additional discourse regarding the random and
systematic drift effects associated with some instruments and to provide insight as to the impact
of these effects on EOPR reliability under various discretionary adjustment thresholds. As
provided in NCSL RP-1 [9], discretionary adjustments may be influenced by one or more of the
following criteria, where this paper focuses specifically on questions #4, #5, #6, & #7:
1) Does parameter adjustment disturb the equilibrium of a parameter, thereby hastening the
occurrence of an out-of-tolerance condition?
2) Do parameter adjustments stress functioning components, thereby shortening the life of
the MTE?
3) During calibration, the mechanism is established to optimize or "center-spec''
parameters. The technician is there, the equipment is set up, the references are in-place.
If it is desired to have parameters performing at their nominal values, is this not the best
time to adjust?
4) By placing parameter values as far from the tolerance limits as possible, does adjustment
to nominal extend the time required for re-calibration?
5) Do random effects dominate parameter value changes to the extent that adjustment is
merely a futile attempt to control random fluctuations?
6) Do systematic effects dominate parameter value changes to the extent that adjustment is
beneficial?
7) Is parameter drift information available that would lead us to believe that not adjusting
to nominal would, in certain instances, actually extend the time required for recalibration?
8) Is parameter adjustment prohibitively expensive?
9) If adjustment to nominal is not done at every calibration, are equipment users being
short-changed?
10) What renewal practice is likely to be followed by calibrating personnel, irrespective of
policy?
11) Which renewal policy is most consistent with a cost-effective interval analysis
methodology?
2015 NCSL International Workshop & Symposium

Weiss [10] addressed the issue of calibration adjustment in some detail in 1991 in a paper
entitled, Does Calibration Adjustment Optimize Measurement Integrity?. Weiss showed that in
the presence of purely random errors associated with a normal probability density function,
where no statistical difference in the mean value of the distributions exists from one calibration
to the next, that calibration adjustment can degrade instrument performance. Weiss and several
other authors [10 14, 56 60] have drawn upon the popular Deming funnel experiment to
illustrate how tampering with or adjusting a calibrated system in a state of statistical control
can introduce additional unwanted variation into a process rather than reduce existing variation1.
As Weiss demonstrates, if the process exhibits purely random error represented by a normal
probability density function, the effect of this tampering is to increase the variance 2 by a factor
of 2. This is equivalent to increasing the standard deviation to a value of 2 , or ~1.414. If
the specification limits were originally set to achieve 95 % confidence (1.96), then this
increased variation from tampering results in an in-tolerance probability (EOPR) of only 83.4 %.
This value becomes important for the interpretations of the results later in this paper in Section 6.
Shah [11] likewise comments in 2007 stating, Calibration has nothing to do with adjustment.
When a measurement system is adjusted to measure the nominal value whether it is within
tolerance or not... Is this advisable or is it causing more harm than good?... Some adjustments
are justified. Others are not. A calibration technician has to make an instant decision on a
measurement taken... Making a bad decision can lead to quality problems It is shown that a
stable process with its inherent natural (random) variation should be left on its own.
Abell [13] also touched on this issue in 2003 noting that, one might be inclined to readjust
points to the center of the specification. The temptation to optimize all points by adjusting to
the exact center between the specifications causes two problems. The first is that it might not be
possible to adjust the instrument on a re-calibration to an optimal center value, even with an
expensive repair. Second, a stable instrument that is unlikely to drift will be made worse by
attempts to optimize its performance.
Payne [14] in 2005 makes similar comments. There are two reasons adjustment is
not part of the formal definition of calibration: (1) The historical calibration data on
an instrument can be useful when describing the normal variation of the instrument or a
population of substantially identical instruments... (2) a single measurement from that process
is a random sample from the probability density function that describes it. Without other
knowledge, there is no way to know if the sample is within the normal variation limits. The
history gives us that information. If the measurement is within the normal variation and not
outside the specification limits, there is no reason to adjust it. In fact, making an adjustment
could just as likely make it worse as it could make it better. W. Edwards Deming discusses the
problem of overadjustment in chapter 11 of Out of the Crisis.
1

In the Deming experiment, a stationary funnel is fixed a short distance directly above the center of a target and marbles are
dropped through the funnel onto the target; the resting spot of each marble is marked. Repeated cycles of this will display resting
spots in a random pattern with a natural fixed common-cause variation () around the targets center, following so-called rule
#1 of never adjusting the position of the funnel. Alternatively, if the operator follows rule #2 and futility attempts to adjust the
position of the funnel after each drop (equal and opposite to the last observed error), the variation of the resting spots increases.

2015 NCSL International Workshop & Symposium

ISO/TS 16949:2009 [57], which supersedes QS9000 quality management requirements for the
automotive industry, also refers to the phenomenon of over-adjustment in Section 8.1.2 by
requiring, Basic statistical concepts, such as variation, control (stability), process capability
and over-adjustment, shall be understood throughout the organization.
The MSA Reference Manual [56] also describes over-adjustment, stating:
the decision to adjust a manufacturing process is now commonly based on measurement
data. The data, or some statistic calculated from them, are compared with statistical control
limits for the process, and if the comparison indicates that the process is out of statistical
control, then an adjustment of some kind is made. Otherwise, the process is allowed to run
without adjustment [However] Often manufacturing operations use a single part at the
beginning of the day to verify that the process is targeted. If the part measured is off target, the
process is then adjusted. Later, in some cases another part is measured and again the process
may be adjusted. Dr. Deming referred to this type of measurement and decision-making as
tampering Over-adjustment of the process has added variation and will continue to do so...
The measurement error just compounds the problem... Other examples of the funnel experiment
are (1) Recalibration of gages based on arbitrary limits i.e., limits not reflecting the
measurement systems variability (Rule 3). (2) Autocompensation adjusts the process based on
the last part produced. (Rule 2).
Nolan and Provost [58] in 1990 also provide the following, Decisions are made to adjust
equipment to calibrate a measurement device etc. All these decisions must consider the
variation in the appropriate measurements or quality characteristics of the process The aim of
the adjustment is to bring the quality characteristic closer to the target in the future. ...there are
circumstances in which the adjustments will improve the performance of the process, and there
are circumstances in which the adjustment will result in worse performance than if no
adjustment is made... Continual adjustment of a stable process, that is, one whose output is
dominated by common causes, will increase variation and usually make the performance of the
process worse.
Bucher, in The Quality Calibration Handbook [59], and The Metrology Handbook [60], states
With regard to adjusting IM&TE, there are several schools of thought on the issue. On one end
of the spectrum, some (particularly government regulatory agencies) require that an instrument
be adjusted at every calibration, whether or not it is actually required. At the other end of the
spectrum, some hold that any adjustment is tampering with the natural system (from Deming)
and what should be done is simply to record the values and make corrections to measurements.
An intermediate position is to adjust the instrument only if (a) the measurement is outside the
specification limits, (b) the measurement is inside but near the specifications limits, where near
is defined by the uncertainty of the calibration standards, or (c) a documented history of the
values of the measured parameter shows that the measurement trend is likely to take it out of
specification before the next calibration due date.
The Weiss and Deming model [10] assume purely random variation for which adjustment is not
only futile, but actually detrimental. In such cases, adjustment or tampering results in an increase
to the standard deviation () of the process by a factor of 1.414, or about 41 %. However, if the
behavior is not purely random, the results can differ. As noted in NCSL RP-1 Appendix G [9],

2015 NCSL International Workshop & Symposium

However, if a systematic mean value change mechanism, such as monotonic drift, is introduced
into the model, the result can be quite different. For discussion purposes, modifications of the
model that provide for systematic change mechanisms will be referred to as Weiss-Castrup
models (unpublished) By experimenting with different combinations of values for drift rate
and extent of attribute fluctuations in a Weiss-Castrup model, it becomes apparent that the
decision to adjust or not adjust depends on whether changes in attribute values are
predominately random or systematic.
Appendix D of NCSL RP-1 describes ten Measurement Reliability Models with #9 being
systematic attribute drift superimposed over random fluctuations (drift model) [9]:
1) Constant out-of-tolerance rate (exponential model).
2) Constant-operating-period out-of-tolerance rate with a superimposed burn-in or wear out
period (Weibull model).
3) System out-of-tolerances resulting from the failure of one or more components, each
characterized by a constant failure rate (mixed exponential model).
4) Out-of-tolerances due to random fluctuations in the MTE attribute (random walk model).
5) Out-of-tolerances due to random attribute fluctuations confined to a restricted domain
around the nominal or design value of the attribute (restricted random-walk model).
6) Out-of-tolerances resulting from an accumulation of stresses occurring at a constant
average rate (modified gamma model).
7) Monotonically increasing or decreasing out-of-tolerance rate (mortality drift model).
8) Out-of-tolerances occurring after a specific interval (warranty model).
9) Systematic attribute drift superimposed over random fluctuations (drift model).
10) Out-of-tolerances occurring on a logarithmic time scale (lognormal model).
This paper investigates behavioral characteristics of instruments that are described by the #9
reliability model above, systematic attribute drift superimposed over random fluctuations (drift
model).
Background information provided in Appendix D of NCSLI RP-1 is highly enlightening with
respect to the Weiss-Castrup Drift model and the decision to adjust or not. Additional
information is also provided by Castrup [54].
A section from Appendix D of NCSLI RP-1 is provided here to facilitate an understanding of the
relationship between systematic and random components of behavior and their influence on both
interval and instrument adjustment decisions, where denotes the normal distribution function:
() =

1
2

()2
22

Where:
= random variable, = standard deviation, = mean
2015 NCSL International Workshop & Symposium

Appendix D of NCSL RP-1 [9]


Drift Model:

(, ) = (1 + 3 ) + (2 3 )

1 ( + )2 /2
=
1 3
1 2

1 ( )2 /2
=
2 3
2 2
2

2
=
[ (1 + 3 ) /2 (2 3) /2 ]
3 2

Figure D-11. Drift Measurement Reliability Model

1 = 2.5, 2 = 0.5, and 3 = 0.5


Renewal Policy and the Drift Model:
In the drift model, if the conditions |(3 1 )| and |(3 2 )| hold, then the measurement reliability
of the attribute of interest is not sensitive to time elapsed since calibration. This is equivalent to saying
that, if the coefficient 3 is small enough, the attribute can essentially be left alone, i.e., not periodically
adjusted.
Interestingly, the coefficient 3 is the rate of attribute value drift divided by the attribute value standard
deviation: 3 = /, where = attribute drift rate, and = attribute standard deviation. From this
expression, we see that the coefficient 3 is the ratio of the systematic and random components of the
mechanism by which attribute values vary with time. If the systematic component dominates then 3 will
be large. If, on the other hand, if the random component dominates, then 3 will be small. Putting this
observation together with the foregoing remarks concerning attribute adjustment leads to the following
axiom:

If random fluctuation is the dominating mechanism for attribute value changes over time, then the
benefit of periodic adjustment is minimal.

As a corollary, it might also be stated that

If drift or other systematic change is the dominating mechanism for attribute value changes over
time, then the benefit of periodic adjustment is high.

Obviously, use of the drift model can assist in determining which adjustment practice to employ for a
given attribute. By fitting the drift model to an observed out-of-tolerance time series and evaluating the
coefficient 3 it can be determined whether the dominant mechanism for attribute value change is
systematic or random. If 3 is small, then random changes dominate and a renew-if-failed only practice
should be considered. If 3 is large, then a renew-always practice should perhaps be implemented.

Copyright 2010 NCSLI. All Rights Reserved. NCSLI Information Manual. Reprinted here under the provisions of the
Permission to Reproduce clause of NCSLI RP-1.

2015 NCSL International Workshop & Symposium

The Weiss-Castrup Drift model described in NCSL RP-1 was primarily intended for the
determination, adjustment, and optimization of calibration intervals in association with Methods
S2 & S3, also called the Binomial Method and the Renewal Time Method, respectively [9].
The Weiss-Castrup drift model is investigated here with a focus on instrument adjustment
thresholds, rather than interval adjustment actions. That is, for a given fixed calibration interval,
how do various discretionary adjustment thresholds (0 % to 100 % of specification), in the
presence of both drift and random variation, affect EOPR reliability? Clearly, if the behavior is
purely random, as in the Weiss and Deming models, an adjust-always policy (0 % adjust
threshold) is detrimental to the instrument performance resulting in decreased EOPR.
However, if the behavior has any element of monotonic drift, as in the Weiss-Castrup Drift
model, an adjustment will be necessary at some point to prevent an eventual OOT condition
resulting from a true attribute bias due to drift. The difficulty manifests during calibration when
attempting to discriminate between attribute bias and a random error. Thus, investigating optimal
adjustment thresholds to maximize EOPR in the presence of random and systematic errors seems
a worthy endeavor. It is also prudent to consider that, even if an optimum adjustment threshold
is determined, there may be other administrative and managerial factors as described in NCSL
RP-1 Appendix G [9] that should be considered when formulating adjustment policies.
The policy of some U.S. Department of Defense military programs and third party OEM
accredited calibration laboratories has been to not routinely, by default, adjust most equipment
unless found out-of-tolerance. For example, The U.S. Navy has the policy of not adjusting test
equipment that are in tolerance. [15].
However, even under some programs which typically employ an adjust-only-if-OOT policy,
discretionary adjustments are still performed for select equipment types. For example, it is not
uncommon to always assign new calibration factors to microwave power sensors, or sensitivity
values to accelerometers, or coefficients to temperature sensors (e.g. RTS, PRTs, etc.),
regardless of the as-found condition of the device. In these cases, rather than judge in-tolerance
or out-of-tolerance based on published specifications, these decisions are often rendered based
on the previously assigned uncertainty, applicable to the assigned value. In these applications,
uncertainties must include a reproducibility component in the uncertainty budget that is
applicable over the calibration interval for stated conditions. Such estimates can be attained by
evaluation of historical performance.
3. Empirical Examples: Systematic Drift Superimposed Over Random Fluctuations
The idea that attribute bias can grow or drift over time is ubiquitous; indeed much of the history
of metrology and the impetus for calibration are predicated on this possibility. Examples of such
behavior are often encountered. The distinction between attribute bias arising from drift (or
otherwise), and a random error, is sometimes only discernable from the analysis of historical
data. Monotonic drift can be estimated using linear regression models. Such is the case with 10 V
DC zener voltage references. Calibration of these devices must be performed via comparison to
other characterized zeners, standard cells, or, in the most accurate cases, Josephson voltage
measurement systems. Due to the inherently low drift characteristics of commercial zener
references, it would not be possible to adequately detect or resolve drift without a measurement
system exhibiting high resolution, low noise, and zero (or well-characterized/compensated) drift.
2015 NCSL International Workshop & Symposium

The data represented in Figure 1 was acquired with a Josephson voltage measurement system.
The noise or variation observed in the data is primarily due to the zener under test and not to
the measurement standard, while all of the observed drift is attributable to the zener and none to
the measurement standard [16, 17]. It may be noted that the fluctuations about the predicted drift
line are not purely random in nature; they are pseudo-random.

Figure 1. Zener drift and pseudo-random variation


Short term variation is also significantly lower than long term variation about the predicted line
(better repeatability than reproducibility). Long term variation is attributable to uncorrected
seasonal pressure and/or humidity dependencies, 1/f noise, white noise, etc. In the presence of
this long-term variation, significant calibration history is necessary in order to confidently
characterize the drift of such instruments. Moreover, some of the apparently random commoncause variation might indeed be correctable. One example is by application of pressure
coefficients to correct for ambient changes in barometric pressure. In many applications, with
enough effort and the availability of measurement systems with ultra-high resolution and
accuracy, some apparently common-cause variation can be revealed as special-cause. All
metrology systems, to include the UUT, will ultimately contain a finite amount of commoncause variation or uncertainty, even after all corrections have been applied.
The R2 value (or coefficient of determination) from the regression is a figure of merit for the
linear drift model and other models, as it compares the amount of variation around the prediction
to the variation resulting from a constant (no-drift) model. R2 is an indicator of the amount of
variation that is explained by linear monotonic drift. Normality tests and visual analysis of the
regression residuals is also beneficial and can reveal secondary non-linear effects.

2015 NCSL International Workshop & Symposium

It is interesting to visually ponder an attempt to characterize zener drift in Figure 1over a


relatively short period of time. Certain instances of data analysis over such time periods might
produce significantly different predictions of drift. This is evident, even via visual examination,
by observing only data from Jan 2001 to Jan 2002 which would result in a positive drift slope.
This illustrates the benefit of long calibration histories when attempting to predict drift in the
presence of random or pseudo-random variation, especially where the periodicity of these
variations is long.
However, a subjective decision must be made when determining how much historical data to
include in the regression. At some point, it may be reasonable to conclude that future behavior,
especially in the short term, is not significantly dependent on data from 10+ years ago. In
general, short-term predictions are better made by assessment of more recent history only, while
long-term predictions might be more accurate using the full comprehensive history. Special
cause variation, such as the loss of power, can justify excluding data previous to the event. This
is a subjective process and heuristic judgment based on experience and knowledge of zener
behavior is helpful in determining how much data to include in the regression.
Zener references are not typically declared in-tolerance or out-of-tolerance by assessment against
a published accuracy specification, but rather to their predicted value and its assigned uncertainty
at a given time during the calibration interval. Zener references are also not typically adjusted,
although provision for electrical adjustment does exist. In lieu of physical/electrical adjustment,
the assigned/predicted value is mathematically adjusted or reassigned over time during
calibration. Algebraic corrections never interfere with the stability of a device nor are they
limited by the resolution of the adjusting mechanism. They do require the manual use of charted
values and uncertainties via reference to a Report of Test or calibration certificate.
By sheer numbers, the majority of items calibrated throughout the world are not predominately
high-level reference standards, but are of the more general variety of Test, Measurement, and
Diagnostic Equipment (TM&DE). Often times, the calibration history of such TM&DE contains
adjustment actions of both in-tolerance and out-of-tolerance instruments. The data shown in
Figure 2 represents actual data from the 50 V DC test point of a 4 digit handheld multimeter
(UUT). On the third calibration event, the UUT was found out-of-tolerance and was adjusted
back to nominal, resulting in zero observed error.
It is visually intuitive that this particular test point displays a high degree of monotonic drift with
very little random variation. In order to perform regression analysis, the magnitude and direction
of the adjustment action must be mathematically removed from the raw calibration data. The
resulting regression analysis is shown in Figure 3.

2015 NCSL International Workshop & Symposium

Figure 2. Calibration history representing as-found and as-left data

Figure 3. Regression of calibration data with adjustments mathematically removed (R2 = 0.96)
However, in many cases of general purpose TMDE, detection of monotonic drift may be more
difficult to resolve due to domination by more random behavior or even special-cause variation
where instruments apparently step out-of-tolerance, rather than drift in a predictable manner.
Such an example, along with the regression analysis is shown in Figures 4 and 5. In such cases, a
model with random fluctuation superimposed on monotonic drift may not be the best model. One
of the other models proposed in NCSLI RP-1 may be more appropriate for such an instrument.

2015 NCSL International Workshop & Symposium

Figure 4. Calibration history representing predominately non-monotonic drift behavior

Figure 5. Regression w/ relatively low R2, indicating significant behavior not explained by drift

2015 NCSL International Workshop & Symposium

4. Assumptions of the Drift Model


The Weiss-Castrup Drift model is investigated in this paper with the following assumptions:
1) The only two change mechanisms for instrument error are (a) linear monotonic drift and
(b) normally distributed random errors. No spontaneous special-cause step transitions or
other variation/ behavior is accommodated.
2) The periodicity and magnitude of the random fluctuations during measurement
(repeatability) is negligibly small compared to the periodicity and magnitude of the
random fluctuations over the calibration interval (reproducibility). Here, a single
measurement is simulated.
3) Tolerance specifications for the UUT are intended to represent approximately 95 %
containment probability. Drift from 0 % to 100 % of specification is modeled as attribute
bias. The normally distributed random component selected as the remainder of the
specification, less the allotted drift bias , where yields 95 % containment probability
for the remaining random error. The higher the allotted drift (), the lower variation ().
4) The drift is constrained between 0 % and ~100 % of the stated specification.
5) The measurement uncertainty at the time of calibration is negligibly small, i.e., high Test
Uncertainty Ratio (TUR) or, equivalently, Measurement Capability Index (Cm).
Laboratory standards do not contribute significantly to the measurement uncertainty.
6) High precision physical and/or electrical adjustment provisions for the UUT are provided
which are capable of rendering an observed error of zero (eOBS = 0) after adjustment. This
may be a poor assumption for multi-range, multifunction instruments with many test
points. Algebraic (manually applied mathematical) corrections are equivalent.
7) Physical or electrical adjustments do not induce any secondary instabilities or otherwise
disturb the equilibrium or stress components of the instrument. No interaction between
adjustment controls for various test points, ranges, or functions is assumed.
8) Observed Out-Of-Tolerance conditions (>100 % of specification) require mandatory
adjustment. The adjustment threshold is constrained between 0 % to 100 % of
specification. However, adjustment thresholds >100 % are briefly investigated.
9) An adjustment action will always negate any previous attribute bias present at the end of
the previous period, but will also (insidiously) result in a present attribute bias equal to
the negative of the previous random error. No quantitative a-priori drift information is
assumed at the time of adjustment. Adjustment will overcompensate by the amount of
previous random error, as in the Deming funnel rule #2.
10) The adjustment threshold is always adhered to. If eOBS > adjustment threshold, an
adjustment will always be performed. If eOBS < adjustment threshold, no adjustment is
performed. Human behavioral/procedural error in adhering to the adjustment threshold is
not accommodated.
11) Symmetry is assumed and only positive drift is simulated with equal implications and
conclusions applicable to negative drift.
Assumptions #2, #3, and #9 above require further comment.
2015 NCSL International Workshop & Symposium

4.1 Assumption (#2): Periodicity and Magnitude of Variation


The Weiss paper and the Deming funnel addressed the periodicity by restricting adjustment
decisions to a single reading or observation. In the Weiss example, a single meter reading and
adjustment was performed every hour. Rather than decrease any attribute bias, the adjustments
resulted in increased random variation. Weiss concludes, The presence and size of the bias
cannot be determined by a single reading; multiple data points are required One must
observe enough data to characterize the variability of the meter readings to know which is the
correct strategy [adjust or not].
Likewise, the model herein assumes that a single measurement is made during calibration or, if
repeated measurements are made and averaged, that the variability during calibration is
negligible with respect to the larger variation that occurs over the calibration interval. That is,
the random fluctuations occurring during the relatively short observation period of calibration
(repeatability) are not representative of, or do not capture, the full extent of the variation
exhibited over the longer calibration interval (reproducibility). This is somewhat akin to the long
term dependency of 1/f noise. On the contrary, if the periodicity and magnitude of fluctuations
are similar, then random fluctuations during calibration interval are represented by those
encountered during the shorter measurement process. Such variations can then be largely negated
with averaging techniques during the measurement process, which should then be capable of
discerning actual attribute bias in the presence of random fluctuations. Under these
circumstances, adjustment could be warranted resulting in a genuine improvement in accuracy.

Figure 6. Measurement variation during calibration, compared with variation over cal interval.

2015 NCSL International Workshop & Symposium

Like the Weiss example and Deming funnel (rule #2), this model presented in this paper will
incorrectly assume the observed UUT error of +60 % shown in Figure 6 is attribute bias, even
under purely random behavior. Such an erroneous assumption will result in a calibration
adjustment magnitude of -60 % in a futile effort to correct for the observed random error. Like
Weiss and Deming, the correct assumption under purely random behavior is that the +60 % error
is common-cause and, if left undisturbed, will soon fluctuate and take on some other random
error represented by the UUT distribution. If this assumption is valid, the correct action would be
to do nothing and not adjust. The model presented here attempts to replicate the actions of the
calibrating technician, whom does not have knowledge of the magnitudes of the individual
systematic attribute bias and vs. random behavior; adjustments are made only on the observed
error at the time of calibration, which is comprised of both bias from drift and random error.
But this decision can only confidently be made if a-priori knowledge of the UUT error
distribution over the course of the calibration interval is known. In many cases, this distribution
is not readily available and discretionary calibration adjustments are made with the assumption
that all of the observed error is an actual attribute bias which will remain (or possibly grow)
unless an adjustment is performed. In an ideal case, the calibration technician would be able to
discern a short-term random error from an actual long-term attribute bias through examination
of historical data. At the time of calibration however, the two types of errors are often
inextricably combined into the observed error, whether obtained from a single reading or
several averaged measurements over a short period of time. The attribute bias is somewhat
hidden in the presence of random error. This is the behavior that is modeled herein.
4.2 Assumption (#3): 95 % Containment Specifications; Selection of Drift vs. Random
This is perhaps the most significant and sweeping assumption used in the model presented here.
The rationale used herein assumes that specifications are generally intended to adequately
accommodate or contain the majority of errors that an instrument might exhibit, with relatively
high confidence (e.g. 95 %). As such, the magnitudes of drift and random variability are selected
as complementary to one another and modeled under this assumption. This greatly restricts the
domain of possible instrument behavior investigated here. Instruments with drift and random
variation, which are both far better (lower) than their specifications might imply, are not modeled
here. Rationale for the assumption and selection of the particular domain of instrument behavior
investigated in this paper is provided here.
As stated in Section 5.4 of NASA HDBK-8739.19-2 [18], In general, manufacturer
specifications are intended to convey tolerance limits that are expected to contain a given
performance parameter or attribute with some level of confidence under baseline conditions
Performance parameters and attributes such as nonlinearity, repeatability, hysteresis,
resolution, noise, thermal stability and zero shift are considered to be random variables that
follow probability distributions that relate the frequency of occurrence of values to the values
themselves. Therefore, the establishment of tolerance limits should be tied directly to the
probability that a performance parameter or attribute will lie within these limits
The selection of applicable probability distributions depends on the individual performance
parameter or attribute and are often determined from test data obtained for a sample of articles
2015 NCSL International Workshop & Symposium

or items selected from the production population. The sample statistics are used to infer
information about the underlying parameter population distribution for the produced items. This
population distribution represents the item to item variation of the given parameter. The
performance parameter or attribute of an individual item may vary from the population mean.
However, the majority of the produced items should have parameter mean values that are very
close to the population mean. Accordingly, a central tendency exists that can be described by the
normal distribution
Baseline performance specifications are often established from data obtained from the testing of
a sample of items selected from the production population. Since the test results are applied to
the entire population of produced items, the tolerance limits should be established to ensure that
a large percentage of the items within the population will perform as specified performance
parameter distributions are established by testing a selected sample of the production
population. Since the test results are applied to the entire population of a given parameter, limits
are developed to ensure that a large percentage of the population will perform as specified.
Consequently, the parameter specifications are confidence limits with associated confidence
levels.
Accuracy specifications are of little benefit if they cannot be relied upon with reasonably high
confidence. Manufacturers sometimes publish specifications at both 95 % and 99 % confidence
levels [Ref 19]. After many calibration cycles, EOPR is then an empirical estimate of that
confidence; i.e. EOPR provides a measure or assessment of the probability for an instrument to
comply with its specifications at the end of its calibration interval.
However, the intent and conditions of specifications and any assumed confidence are subject to a
certain amount of interpretation and inference. Is the confidence level specifically stated or is it
implied? Does the confidence level of the specification apply to a single test point, or to a single
instrument, or to a population of similar instruments?
For example, the published absolute uncertainty specification at a 95 % confidence level for a
Fluke 8508A DMM, at 20 VDC, is 3.2 ppm [Ref 19]. The same 20 VDC point has a published
uncertainty of 4.25 ppm expressed at a 99 % confidence level. As manufactured and if properly
used, it might be reasonable for the end-user of this DMM to apply the stated specification at this
particular 20 VDC test point and assume the stated confidence level applies.
However, it can be argued that for multifunction instruments with multiple test points, the actual
confidence level of any individual test point must be much greater than 95 % or even 99 %
confidence if the instrument as-a-whole is expected to meet its specifications with the stated
confidence.
As Deaver has noted [20]each Fluke Model 5520A Multiproduct Calibrator is tested at 552
points on the production line prior to shipment. If each of the points has a 95% probability of
being found in tolerance, there would only be a 0.95552 = 0.000000000[0]51% chance of finding
all the points within the specification limits if the points are independent! Even if we estimate
100 independent points (about 2 per range for each function), we would still have only a 0.95100
= 0.6% chance of being able to ship the product.
2015 NCSL International Workshop & Symposium

Similar statements have been published by Dobbert [Ref 21, 21a]. A common assumption is
that product specifications describe 95% of the population of product items [emphasis added].
From the mean, , and standard deviation, , an interval of [ - 2, + 2] contains
approximately 95% of the population. However, when manufacturers set product specifications,
the test line limit is often set wider than 2 from the population mean...
For choosing the tolerance interval probability, a generally accepted minimum value is 95%.
However, manufacturers may choose a probability other than 95% for different reasons.
Consider again a multi-parameter product. Manufacturers wish to have high yields for the entire
product so that the yield considering all parameters meets the respective test line limits. If the
product parameters are statistically independent, the overall yield, in this case, is the product of
the probability for each parameter. For a product with just three independent parameters, each
with a test limit intended to give 95% probability, the product would only have a (0.95%)3 or
85.7 % chance of meeting all test line limits, which is perhaps unacceptable to the manufacturer.
For this reason, manufacturers select tolerance interval probabilities greater than 95% so that
the overall probability is acceptable.
When discussing drift, Dobbert also notes, Stress due to environmental change, as well as
everyday use, transport, aging and other factors may induce small changes in performance that
accumulate over time. In other words, products drift. The effect of drift is that from the time of
manufacture to the end of the initial calibration interval, it is likely that performance has shifted.
a population of product items also experiences a shift in the mean, a change in the standard
deviation, or both, due to the mechanisms associated with drift
To ensure products meet specification over the initial calibration interval, manufacturers may
include an additional guard band between the test line limit and the specification In the
simplest case, the total guard band between the test line limit and the specifications is the sum of
the individual guard band components for environmental factors, drift, measurement uncertainty
and any other required component. For example, = + + +
gives what is often the initial specification for a product. For the final specification,
manufacturers must consider manufacturing costs, market demands and competing product
performance.
When discussing manufacturers specifications propagated into uncertainty analyses, Dobbert
additionally notes, The GUM provides guidance for evaluation of standard uncertainty and
specifically includes manufacturers specifications as a source of information for Type-B
estimate To evaluate a Type-B uncertainty, the GUM gives specific advice when an
uncertainty is quoted at a given level of confidence. In this instance, an assumption can be made
that a Gaussian distribution was used to determine the quoted uncertainty. The standard
uncertainty can then be determined by dividing by the appropriate factor given the stated level of
confidence. Various manufacturers state a level of confidence for product specifications and
applying this GUM advice to product specifications quoted at a level of confidence is common
and accepted by various accreditation bodies.

2015 NCSL International Workshop & Symposium

The assumption, used in the model investigated by this paper, is that a specification represents
95 % containment probability of errors for a given test point; thus, the magnitude and proportion
of drift and random components are modeled accordingly (see Section 5). This may be a
significant assumption and highly conservative, especially where actual instrument performance
at a given test point exhibits systematic drift (bias) and random error components much lower
than represented by the specifications. For example, the domain of performance for instruments
displaying drift () of only 10 % of specification per interval and, at the same time, a random
component () of only 20 % of specification is not modeled here.
However, a great many instruments may be well capable of performing at such levels, i.e.
considerably better than their specifications would imply. This is especially true if one assumes
that the manufacturer has built significant margins or guardbands into the specifications and/or
that the confidence level of specifications is intended to represent an entire population of
instruments, or one instrument as-a-whole, rather than a single test point. Investigations of such
domains of behavior, and the effect on EOPR of various adjustment thresholds under such
improved instrument performance, may be highly insightful and are deferred to future
explorations2. Moreover, models where random variation () itself increases with time (such as
random-walk models) would be useful, with or without a drift component. Such a model, even in
the absence of monotonic drift, exhibits a time-dependent mechanism for transitioning to OOT3.
4.3 Assumption (#9): Mandatory Adjustment of OOT Conditions is Required
In practice, calibration laboratories, which are charged with verification as part of the calibration
process, are required to perform an adjustment if the UUT if it exceeds the allowable tolerance(s)
(>100 %) defined by the agreed-upon specifications. It is not generally acceptable to return an
item to the end-user as calibrated, while exhibiting an observed OOT condition.
However, in a Weiss or Deming model where fluctuations are purely random, this would appear
the correct course of action. The OOT condition, like the in-tolerance condition, should not be
adjusted; it should be allowed to remain with the assumption that it will soon decrease and takeon some other random value which will likely be contained within the specification limits. In this
regard, there is nothing special about the OOT condition. It is simply part of the normal
common-cause random variation that will inevitably, albeit rather infrequently (e.g. 5 %), fall
outside of specification limits which are intended to represent 95 % confidence or other
containment probability. Appendix G of NCSLI RP-1 perhaps best describes this as a logical
predicament when discussing non-adjustment of items as follows:
If we can convince ourselves that adjustment of in-tolerance attributes should not be made, how
then to convince ourselves that adjustment of out-of-tolerance attributes is somehow beneficial?
For instance, if we conclude that attribute fluctuations are random, what is the point of adjusting
attributes at all? What is special about attribute values that cross over a completely arbitrary
line called a tolerance limit? Does traversing this line transform them into variables that can be
controlled systematically? Obviously not.
More on the topic of non-adjustment of OOT conditions is presented later in Section 6.
2
3

The author thanks Jonathan Harben of Keysight Technologies for these astute suggestions.
The author thanks Dr. Howard Castrup of Integrated Sciences Group for this valuable observation.

2015 NCSL International Workshop & Symposium

The model presented herein concedes to the conventional industry practice which mandates
adjustment of items which are observed to be out-of-tolerance. Where the observed error is
predominately a long-term attribute bias, resulting from systematic monotonic drift or
otherwise, adjustment is a beneficial action. Such attribute bias is likely to remain or possibly
grow larger if left unadjusted. However, where the observed error resulted predominately from a
short-term random event, adjustment will be the incorrect decision. Like the calibration
technician, this model assumes (correctly or incorrectly) that all observed as-received errors
represent systematic attribute bias; adjustment actions will be implemented according to the
adjustment threshold parameter set for the model (0 % to 100 % of specification). In this sense,
the model feigns ignorance of the constituent proportion of random to attribute bias during
adjustment actions but, in actuality, is privy to the amount of attribute bias at all times in the
simulation. For investigational purposes, adjustment thresholds >100 % of specification are
briefly discussed, although they are believed unlikely to find application in most calibration
laboratories.
5. Modeling and Selection of Magnitude for Drift and Random Variation.
The illustration in Figure 7 represents the general concept of monotonic drift superimposed on
constant random variation.

Figure 7. Monotonic drift superimposed on constant random variation


LEFT: The random variation has been superimposed on no drift at all (0 %) and the
specification adequately contains the 95 % of the random errors.
CENTER: The random variation has been superimposed on drift in the amount of 50 % of
specification. The mean of this distribution at the end of the calibration interval is not zero,
but is equal to the amount of drift accumulated over the calibration interval (50 %). Thus, a
significant portion (16.4 %) of errors will exceed the upper specification limit and only 0.2 %
will exceed the lower specification limit. Only 83.5 % will be in-tolerance (EOPR).
RIGHT: The random variation is superimposed on drift in the amount of 100 % of
specification. The mean of the distribution is shifted to 100 % of specification, resulting in
50 % of the errors exceeding the upper specification limit when received for calibration. This
is generally an unacceptable situation, as End Of Period Reliability of 50 % is below most
industry accepted reliability targets. See Section 7 for examples of EOPR objectives.
2015 NCSL International Workshop & Symposium

Figure 7 represented random variation as a normal probability distribution with constant width (
= constant). However, if the specification limits are intended to provide a containment
probability of 95 % as discussed in Section 4.2, then any allowable drift must result in a
commensurate reduction in the amount of allowable random variation in order to still provide a
95 % confidence. In the model used herein, the amount of drift is first selected as a percentage
(0 % to 100 %) of the allowable specification over one interval. This will result in a systematic
drift-induced attribute bias at the end of one interval equal to the amount of specified drift. OOT
incidents will tend towards the direction of drift; e.g. for a positive drift allowance, OOT
conditions will predominately be found exceeding the upper specification limit in only one tail of
the distribution. The resulting drift, after one interval, forms the mean () of the normally
distributed random component.
Since the intent of the accuracy specification is assumed to represent a 95 % containment
probability for the error, the remaining portion of the specification is then modeled as a normally
distributed random component with a standard deviation of () selected to still provide 95 %
containment (see Table 1 and Figure 8). This complementary aspect of these two components is
necessary to provide the desired containment probability. As discussed in Section 4.2,
specifications are often, directly or implied, provided by the OEM with an allowance for drift
designed into them and provided at a relatively high confidence level. This is the basis for the
choice of magnitudes for the model used here. As the drift component dominates and approaches
the 100 % specification limit, the random component approaches zero. That is, as the systematic
drift () increases, the random variation () decreases, as shown in Figure 8.

Figure 8. Positive drift superimposed on complementary random variation


To maintain 95 % EOPR, a perfect adjustment would need to be made at the end of each
calibration interval (in-tolerance or not). This is necessary to reduce the attribute bias (due to
drift or otherwise) to zero. Only if this ideal adjustment always occurs at each the end of each
calibration interval would 95 % EOPR be achievable in this model. However, such adjustment
will not be possible in this model, due to the nature of the random variation precluding an ideal
adjustment. Thus, EOPR will be less than 95 % for adjustment thresholds between 0 % and 100
% of specification.
2015 NCSL International Workshop & Symposium

Table 1. Magnitude of drift () and random () components, modeled to maintain 95 % in-tolerance confidence.
() Given as percentage of the specification.
() Given as a percentage of the specification per interval.
Mean

S.D.

Ratio

Left Tail

Right Tail

Mean

S.D.

Ratio

Left Tail

Right Tail

Mean

S.D.

Ratio

Mean

S.D.

Ratio

()

()

( / )

OOT

OOT

OOT

OOT

()

()

( / )

()

()

( / )

Random

Prob.

Prob.

()
Random

( / )

Drift

()
Drift

Prob.

Prob.

Drift

Random

Drift

Random

0%

51.021 %

0.000 2.500 %

2.500 %

26 %

44.386 %

0.586

0.226 %

4.774 %

51 %

29.790 %

1.71

76 %

14.591 %

5.21

1%

51.011 %

0.020 2.385 %

2.614 %

27 %

43.881 %

0.615

0.190 %

4.810 %

52 %

29.181 %

1.78

77 %

13.983 %

5.51

2%

50.981 %

0.039 2.271 %

2.729 %

28 %

43.363 %

0.646

0.158 %

4.842 %

53 %

28.573 %

1.85

78 %

13.375 %

5.83

3%

50.933 %

0.059 2.158 %

2.843 %

29 %

42.834 %

0.677

0.130 %

4.870 %

54 %

27.966 %

1.93

79 %

12.767 %

6.19

4%

50.864 %

0.079 2.044 %

2.956 %

30 %

42.291 %

0.709

0.106 %

4.894 %

55 %

27.358 %

2.01

80 %

12.159 %

6.58

5%

50.776 %

0.098 1.933 %

3.068 %

31 %

41.739 %

0.743

0.085 %

4.915 %

56 %

26.749 %

2.09

81 %

11.551 %

7.01

6%

50.668 %

0.118 1.822 %

3.178 %

32 %

41.177 %

0.777

0.067 %

4.933 %

57 %

26.142 %

2.18

82 %

10.943 %

7.49

7%

50.539 %

0.139 1.712 %

3.287 %

33 %

40.606 %

0.813

0.053 %

4.947 %

58 %

25.534 %

2.27

83 %

10.335 %

8.03

8%

50.392 %

0.159 1.605 %

3.395 %

34 %

40.029 %

0.849

0.041 %

4.960 %

59 %

24.926 %

2.37

84 %

9.727 %

8.64

9%

50.224 %

0.179 1.499 %

3.500 %

35 %

39.444 %

0.887

0.031 %

4.969 %

60 %

24.318 %

2.47

85 %

9.119 %

9.32

10 %

50.038 %

0.200 1.396 %

3.604 %

36 %

38.856 %

0.926

0.023 %

4.977 %

61 %

23.710 %

2.57

86 %

8.511 %

10.1

11 %

49.831 %

0.221 1.296 %

3.705 %

37 %

38.263 %

0.967

0.017 %

4.983 %

62 %

23.102 %

2.68

87 %

7.904 %

11.0

12 %

49.603 %

0.242 1.198 %

3.803 %

38 %

37.666 %

1.01

0.012 %

4.988 %

63 %

22.494 %

2.80

88 %

7.296 %

12.1

13 %

49.356 %

0.263 1.103 %

3.898 %

39 %

37.066 %

1.05

0.009 %

4.991 %

64 %

21.886 %

2.92

89 %

6.687 %

13.3

14 %

49.088 %

0.285 1.011 %

3.989 %

40 %

36.464 %

1.10

0.006 %

4.994 %

65 %

21.278 %

3.05

90 %

6.080 %

14.8

15 %

48.801 %

0.307 0.922 %

4.078 %

41 %

35.861 %

1.14

0.004 %

4.996 %

66 %

20.670 %

3.19

91 %

5.472 %

16.6

16 %

48.495 %

0.330 0.838 %

4.163 %

42 %

35.256 %

1.19

0.003 %

4.998 %

67 %

20.062 %

3.34

92 %

4.864 %

18.9

17 %

48.168 %

0.353 0.757 %

4.243 %

43 %

34.650 %

1.24

0.002 %

4.998 %

68 %

19.454 %

3.50

93 %

4.256 %

21.9

18 %

47.821 %

0.376 0.680 %

4.320 %

44 %

34.043 %

1.29

0.001 %

4.999 %

69 %

18.846 %

3.66

94 %

3.648 %

25.8

19 %

47.456 %

0.400 0.608 %

4.393 %

45 %

33.435 %

1.35

0.001 %

4.999 %

70 %

18.238 %

3.84

95 %

3.040 %

31.3

20 %

47.071 %

0.425 0.540 %

4.461 %

46 %

32.828 %

1.40

0.000 %

4.999 %

71 %

17.630 %

4.03

96 %

2.432 %

39.5

21 %

46.666 %

0.450 0.476 %

4.524 %

47 %

32.220 %

1.46

0.000 %

4.999 %

72 %

17.022 %

4.23

97 %

1.824 %

53.2

22 %

46.244 %

0.476 0.417 %

4.583 %

48 %

31.613 %

1.52

0.000 %

5.000 %

73 %

16.415 %

4.45

98 %

1.216 %

80.6

23 %

45.805 %

0.502 0.362 %

4.638 %

49 %

31.005 %

1.58

0.000 %

5.000 %

74 %

15.807 %

4.68

99 %

0.608 %

163

24 %

45.347 %

0.529 0.312 %

4.687 %

50 %

30.398 %

1.64

0.000 %

5.000 %

75 %

15.199 %

4.93

100 %

0.000 %

N/A

25 %

44.874 %

0.557 0.267 %

4.733 %

2015 NCSL International Workshop & Symposium

AS-RECEIVED
In Tol ?
Adjust ?
(eOBS < Y
(eOBS >
Adjustment
Tolerance?)
Threshold?)

AS-LEFT

Previous
Cumulative
Systematic Bias

[ (1)

DURING CAL INTERVAL


Systematic
Monotonic
Drift

Random
Component
(Norm Dist)

eDRIFT

eRAND

END PERIOD

eEOP =
eBIAS +
eDRIFT +
eRAND

=1

+ (1) ]

Adjustment
Performed
(Adjustment = -eOBS)
Result: eOBS = 0

OBS

Negative
= eBIAS (i-1)
Previous
Random
Component
= -eRAND (i-1)

BIAS

DRIFT

RAND

EOP

Figure 9. Monte Carlo simulation model for one calibration interval

eOBS = Error Observed for the UUT, as-received. It is equal to the End-Of-Period error for the
previous calibration interval (eEOP (i-1) ). Only a portion of eOBS is due to systematic error (eBIAS (i-1) +
eDRIFT (i-1)). However, any adjustments are performed equal-and-opposite to the whole of eOBS,
which includes random error (eRAND (i-1)) in addition to systematic error (eBIAS (i-1) + eDRIFT (i-1)).

eBIAS = UUT attribute bias, as-left. If no adjustment has been made, eBIAS remains the same as the
sum of the systematic errors at the end of the previous calibration interval (eBIAS (i-1) + eDRIFT (i-1)). If
an adjustment is made, eBIAS is equal to the negative of the previous random error (-eRAND (i-1)). After
adjustment, eBIAS is zero only if the random error during the previous cal interval (eRAND (i-1)) was
zero (unlikely). Adjustment actions will always negate previously accumulated attribute bias,
but will also result in attribute bias of their own, due to an overcompensated adjustment.

eDRIFT = Error of UUT attributable to monotonic drift. If no adjustment is made, this systematic
drift error carries over or accumulates from one calibration interval to the next. For the model,
eDRIFT is specified as a percentage of the allowable tolerance or accuracy specification. The
remainder of the specification is then allocated to eRAND as (100 % - Drift %).

eRAND = Error of UUT attributable to random behavior. A random number generator is used to
select eRAND from a normal Gaussian distribution. Ideally, no adjustment should be made to
compensate for this component. This is common-cause variation with an assumed period
significantly longer than the observation period during calibration. If all variation is random,
adjusting is equivalent to tampering with a system which may otherwise be in a state of
statistical control. It is analogous to moving the funnel in the Deming experiment.

eEOP = Error of UUT at End of Period (includes attribute bias, plus drift, plus random error).
2015 NCSL International Workshop & Symposium

6. Results
The results in Figure 10A and 10B were rendered via the Monte Carlo method to visually
investigate aspects of the Weiss-Castrup drift model with regard to adjustment thresholds. The
model in Figure 9 is repeated for 100 000 iterations and the number of Out-Of-Tolerance
instances for eOBS is tallied over the 105 cycles. The End-of-Period-Reliability is then computed as
EOPR = (105 OOTs)/105. This process is repeated ten times with the average taken to arrive at
a final simulated EOPR output, applicable to a specifically chosen pair of values in the model,
i.e. (1) the amount of monotonic drift and (2) the adjustment threshold. A 101 x 101 matrix of
EOPR values is then generated by looping the process in +1 % increments from 0 % to 100 %
for both the monotonic drift variable and the adjustment threshold variable. In total, ~1010 Monte
Carlo iterations are used in the generation of the matrix. This requires considerable
computational brute-force and consumed approximately 43 hours of CPU time running under
MS Windows 7 in Excel 2010 using an Intel CoreTM i5 4300 CPU clocked at 2.6 GHz. See
Appendix B for a discussion of using Excel for Monte Carlo methods.
The resulting multivariate matrix can then be plotted as a three dimensional surface plot (Figures
10A & 10B), with the EOPR values displayed on the vertical z-axis. The x-axis represents the
monotonic drift rate and the y- axis represents the adjustment threshold, from 0 % to ~100 %
each. This provides insight into the effects that these variables impart to EOPR, which is
arguably the most important quality metric for many calibration and metrology organizations.
Other important quality metrics, such as Test Uncertainty Ratio (TUR) and the Probability of
False Accept (PFA), are inextricably interrelated to the observed EOPR [22, 23].

Figure 10A. 3D surface plot of EOPR as a function of adjustment threshold and drift
2015 NCSL International Workshop & Symposium

Figure 10B. 3D surface plot of EOPR as a function of adjustment threshold and drift
It is important to bear in mind the nature of the x-axis, representing drift in Figures 10A and 10B.
As the amount of drift increases, the random behavior decreases, as assumed by this
particular model (see Table 1). Other modeling can be performed with different parametric
assumptions, e.g. where the random variation is held constant (or grows larger) in the presence of
increasing drift. Still other assumptions, such as zero drift and increasing random variation, e.g.
random-walk models, could be modeled. Such investigations would provide additional insight.
It should also be noted that here, the x-axis merely approaches 100 % drift (zero random error).
When drift is exactly 100 % of specification with zero random error, all adjustment thresholds
100 % result in 100 % EOPR. In that case, adjustments are always performed and they are
always perfect due to the absence of random error (assuming infinite TUR; see assumption #5
in Section 4).
Many implications exist from the resulting model in Figure 10A and 10B for the stated
assumptions. Perhaps the most significant commonality in all instances is that, as the calibration
adjustment threshold increases from 0 % to 100 % of specification, the EOPR remains constant
or decreases in all cases; it never increases. This is further illustrated in Figure 11.

2015 NCSL International Workshop & Symposium

Figure 11. EOPR as a function of adjustment threshold for various levels of drift
In Figure 11, note that for the case of purely random variation with zero drift (green line), the
EOPR is constant at 83.4 %, just as the Weiss and Deming model would predict when
adjustments are always made (i.e. adjustment threshold of 0 % of specification). However, it is
interesting to note that this 83.4 % EOPR does not improve as the adjustment threshold is
increased from 0 % (always adjust) towards 100 % of specification (adjust less frequently).
Why does an increase in EOPR (reduction in variability) not result, in this purely random case,
as the adjustment threshold increases from 0 % to 100 % (i.e. less frequent adjustments)? The
answer to this question can be elucidated if the scale of the adjustment threshold and y-axis are
extended beyond the 100 % of specification limit (OOT point). With the model constrained to a
maximum of 100 % adjustment threshold in the purely random case, adjustments will still be
made for all observed OOT conditions. Even though these adjustments occur less frequently than
the always-adjust scenario (0 % adjustment threshold), the magnitude of these less frequent
adjustments or tampering is always quite large. For purely random systems, these large but lessfrequent adjustments for observed OOT conditions ultimately result in the same outcome as the
Weiss and Deming model predict; i.e. they lead to the same increased variability (2 2 or 2)
and resulting lower EOPR (83.4 %), just as if adjustment or tampering was performed every
time.
If the adjustment threshold is increased to 500 % of specification (or more), and the simulation is
run again, a decrease in variability (from 2 to ) and resulting increase in EOPR (from 83.4 %
to 95 %) is indeed observed. However, the transition region where this phenomenon occurs is not
well-behaved (see Figure 12). That is, as the adjustment threshold is raised above 100 % of
specification, fewer and fewer adjustments are ever made. The probability of adjustment
2015 NCSL International Workshop & Symposium

becomes exceedingly low. However, when one of these very rare events does occur, triggering
an adjustment (after many thousands of iterations of the Monte Carlo simulation), the effect is
quite significant. Since it was presumed to be a random event, no adjustment should have been
made (even at 150 %, 200 %, 300 % of specification, or more). Adjusting such a large random
error imparts an equally large attribute bias, opposite in magnitude.

Figure 12. Monte Carlo modeled behavior for random errors w/ adjust thresholds >100 % of spec
If the Monte Carlo simulations are extended to include adjustment thresholds far above 100 % of
specification (>OOT), the EOPR behavior becomes somewhat erratic between 150 % and 270 %
of specification. It ultimately settles at the 95 % EOPR, just as if no adjustments were ever made,
because no adjustments are essentially ever made when the adjustment threshold is so large. The
repeatability of the Monte Carlo process is also poor in this transition region (even with 106
iterations) because the results of the simulation are highly sensitive to very improbable events.
After the adjustment threshold extends beyond ~270 % of specification (~5.5), adjustment
actions become so rare as to approach the never adjust scenario of the Deming funnel (rule #1)
where the variation is lowest. Under these circumstances, the EOPR settles at the original 95 %
containment probability of the purely random variation with respect to the 1.96 specification
limits.
This scenario will likely find little application in calibration laboratories. One would have to be
willing to not adjust instruments with observed errors >>100 % of specification (highly OOT).
The rationale for such decision would be to attribute all errors (regardless of how large) as purely
random events that would not remain if simply left alone and not adjusted. In reality, such large
errors may be much more likely to be true attribute bias resulting from special-cause variation
such as misuse, over-ranging, rough handling, etc. Analysis of historical data is of great benefit
when attempting to characterize such errors.
2015 NCSL International Workshop & Symposium

7. EOPR Reliability Targets


The use of EOPR as a quality metric for calibrated equipment is of great importance. EOPR
targets are analogous to an Acceptable Quality Level (AQL) in manufacturing environments.
Both metrics speak to the percentage of items that comply with their stated specifications,
although AQLs are expressed as the complement of this (i.e. tolerable percent defective, not to
be confused with LTPD). Calibration intervals are often adjusted in an effort to achieve these
goals. Target EOPR levels are often proprietary for commercial and private industry. However,
it is insightful to review some EOPR objectives for calibrated equipment in military and
aerospace organizations. A summary of such targets is provided here.
TARGET EOPR LEVELS

NASA Kennedy Space Center (KNPR 8730.1, Rev. Basic-1; 2003 to 2009, Obsolete)
At KSC, calibration intervals are adjusted to achieve an EOPR range of 0.85 to 0.95. [24]

U.S. Navy (OPNAV 3960.16A; 2005)


CNO policy requires USN/USMC to: (o) Establish an objective end of period reliability goal for
TMDE equal to or greater than 85 percent, with the threshold reliability in no case to be lower
than 72 percent. [25]

U.S. Navy (Albright, J. Thesis; 1997)


intervals are based on End-Of-Period (EOP) operational reliability targets of 72% for noncritical General Purpose Test Equipment (GPTE) and 85% for critical Special Purpose Test
Equipment (SPTE). [26]

U.S. Air Force (TO 00-20-14; 2011)


The Air Force calibration interval is the period of time over which the equipment shall perform
its mission or function with a statistically derived end-of-period reliability (shall be within tolerance)
of 85% or better. [27]

U.S. Army (GAO B-160682, LCD-77-427; 1977, Obsolete)


the Army decided to follow the Air Force's and Navy's lead in establishing an 85-percent endof-period reliability requirement. However, the Army has adopted a new statistical model and
changed its policy to require 75-percent end-of-period reliability. [28]

U.S. Army (AR 750-43; 2014, Current)


On average, 90 percent of items will be in tolerance over the calibration interval, and 81 percent
will be in tolerance at the end of the interval. [29]

The NCSL International Benchmarking Survey (LM-5) provides additional information on


EOPR targets, termed Average % In-Tolerance Target [55]. In the survey, statistics were
aggregated from 357 national and international respondents polled in 2007. Demographics
included aerospace, military & defense, automotive, biomedical/pharmaceutical,
chemical/process, electronics, government, healthcare, M&TE manufacturers, medical
equipment, military, nuclear/energy, service industry, universities and R&D, and other. This
NCSLI survey found:

4 % of respondents employ EOPR targets <85 %


19 % of respondents employ EOPR targets between 85 % and 90 %
25 % of respondents employ EOPR targets between 91 % and 95 %
52 % of respondents employ EOPR targets >95 %

2015 NCSL International Workshop & Symposium

8. Non-Adjustable Instruments
It should be noted that, in the presence of any amount of monotonic drift regardless of how
small, an adjustment will eventually have to be made or the attribute bias will ultimately exceed
the allowable specification. Indeed, the very practice of shortening an interval to increase EOPR
is somewhat predicated on some form of time-dependent mechanism increasing the magnitude of
possible errors, along with the ability to adjust (reduce) the attribute bias to or near zero.
For non-adjustable instruments, EOPR cannot generally be increased by shortening a calibration
interval via the same mechanism applicable to adjustable instruments. However, shortening the
calibration interval for non-adjustable instruments can still be beneficial in two ways.
1) An increase in EOPR can still result from shortening the calibration interval for nonadjustable instruments which exhibit a relatively small time-dependent mechanism for
transitioning to an OOT condition (e.g. low drift). This is true because more in-tolerance
calibrations will be performed prior to the occurrence of an OOT condition. Once a nonadjustable instrument incurs its first OOT condition, it cannot be adjusted back into
tolerance and has effectively reached the end of its service life at which point EOPR =
(#Calibrations - 1) / (#Calibrations). The shorter the interval, the more in-tolerance
calibrations will have been performed and the higher the EOPR will be. After the first OOT
event, the instrument must then be retired from service or the allowable tolerance must be
increased with consent from the end-user or charted values must be manually employed
via a Report of Test or Calibration Certificate. Such action should only be taken if no impact
will result to the application or process for which the instrument is employed.
2) Organizational benefits, other than increased EOPR, can also be realized through shortening
of calibration intervals for non-adjustable instruments. These benefits do not manifest as an
increase in EOPR, but rather in a reduction of the exposure to possible consequences
associated with an out-of-tolerance condition. For example, a working-standard resistor
(calibrated to a tolerance) may not be adjustable. An out-of-tolerance condition may
eventually arise from drift or even special-cause variation (over-power/voltage, mechanical
shock/damage, etc.) Shortening the calibration interval will provide no direct benefit to
EOPR via a reduction in errors through adjustment. However, since any OOT condition will
result in an impact assessment (reverse traceability) for all instruments calibrated by this
OOT resistor, a shorter calibration interval will reduce the amount of impact possible
assessments and risk exposure to product or process, providing benefits of a different nature.
9. Conclusions
Discretionary adjustment during calibration of in-tolerance equipment is not mandated by
national and international calibration standards ANSI/Z540.3 and ISO-17025, nor is adjustment
contained within the VIM definition of calibration. A model has been used here in an attempt to
describe the effect of various discretionary adjustment thresholds on in-tolerance instruments,
assuming a specific behavioral mode called the Weiss-Castrup drift model and under very
specific assumptions. These assumptions may not hold for many items of TM&DE. Other
alternative assumptions, where the domain of drift and random behavior simultaneously
comprise only a small percentage of the associated specification, may yield significantly
different results and are worthy of further investigation.
2015 NCSL International Workshop & Symposium

Using Monte Carlo methods, the effect of various discretionary adjustment thresholds on End Of
Period Reliability (EOPR) has been investigated for in-tolerance instruments under these specific
conditions. For the model and assumptions stated, it is shown that discretionary adjustments of
in-tolerance instruments can be beneficial in the presence of monotonic drift superimposed on
random variation. Under these conditions, the non-adjustment benefits of reduced variation
(increased EOPR), posed by the Weiss model and Deming funnel model, do not appear to
manifest between the 0 % and 100 % of specification adjustment thresholds. As the calibration
adjustment threshold increases from 0 % to 100 % of specification, the EOPR remains constant
or decreases in all cases; it never increases. Only after the adjustment threshold far exceeds
100 % of specification and effectively approaches the never-adjust scenario, are these benefits
realized for purely random behavior. Never adjusting items with any significant amount of
monotonic drift is not a viable option, as these instruments will rather quickly transition to an
OOT condition resulting from a true attribute bias due to drift.
The assumptions of the model may be idealized and unrealistic in the empirical world. Moreover,
it may be unlikely that the behavior of any instrument would be entirely restricted to only the
two change mechanisms accommodated by this model or the domain of magnitudes and/or
proportions of drift and random behavior restricted to the values modeled here. Many general
purpose TM&DE instruments may perform considerably better than their specifications would
imply. They may also be impacted by other behavioral characteristics and special cause events,
hindering the use of this model and of the use of linear regression as a prediction technique.
Random walk behavior, where the magnitude of the random variation () itself increases with
time may be more realistic in many cases. Under such random walk models, the probability of
OOT events increases with time, even in the absence of monotonic drift. Much opportunity for
continued investigations and research exists in this regard. However, the assumptions stated
herein, when combined with the Weiss-Castrup drift model, provide a rudimentary working
construct with which to glean useful insight into the effect of various adjustment thresholds for
in-tolerance instruments under a variety of systematic and random errors.
Many programmatic factors must be considered when implementing instrument adjustment
policies or thresholds, above and beyond the exclusive consideration of maximizing EOPR.
Instrument adjustment can increase expense to a company or calibration laboratory in that AsReceived data must be acquired prior to adjustment, and As-Left data must be taken after the
adjustment. The model presented here strives to encourage additional investigation while
providing program managers and metrology professionals with a tool to assist in the
establishment of instrument adjustment policies and to guide possible decision processes. Astute
policy makers will likely use a variety of tools, models, assumptions, and empirical data,
balancing many options and objectives, to achieve the most prudent adjustment policy for a
particular organization.

2015 NCSL International Workshop & Symposium

10. Acknowledgements and Disclosures


The author wishes to thank Tom Waltrich, Nancy Mescher, and Jerry Phillips of Baxter
Healthcare Corporation for many fruitful discussions of the material presented here relating to
instrument drift. Much gratitude is extended to Dr. Howard Castrup of Integrated Sciences
Group and Jonathan Harben of Keysight Technologies for insightful comments, critiques, and
suggestions during review of this paper. Material presented here was adopted from, or inspired
by, NCSLI RP-1 with regard to renewal/adjustment policies. The writing efforts of the NCSL
International Calibration Interval Committee are gratefully acknowledged and appreciated. No
endorsement of the work presented here, by the aforementioned parties, is implied. At the time of
publication, the results and conclusions of modeling presented in this paper are considered
preliminary and have not had the benefit of vetting by other independent sources. Such review is
highly desired and encouraged.

2015 NCSL International Workshop & Symposium

11. Bibliography
1. ANSI/NCSL Z540.3:2006. Requirements for the Calibration of Measuring and Test
Equipment. American National Standards Institute / NCSL International, 2006.
http://www.ncsli.org/I/i/p/z3/c/a/p/NCSL_International_Z540.3_Standard.aspx?hkey=7de8317116ff-416c-9182-94c8447fb300
2. NCSL Z540.3:2006 Handbook, Handbook for the Application of ANSI/NCSL Z540.3-2006
Requirements for the Calibration of Measuring and Test Equipment. American National
Standards Institute / NCSL International. 2006.
http://www.ncsli.org/I/i/p/zHB/c/a/p/Zhb1.aspx?hkey=572363f0-59e9-4817-8b65-ae6ba5d8ff24
3. ISO/IEC 17025:2005(E). General Requirements for the Competence of Testing and
Calibration Laboratories. International Organization for Standardization / International
Electrotechnical Commission. 2005.
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=39883
4. JCGM 200:2012 (ISO/IEC Guide 99-12:2007). International Vocabulary of Metrology
Basic and General Concepts and Associated Terms (VIM). Joint Committee for Guides in
Metrology - Working Group 2, 3rd Edition. 2008.
http://www.bipm.org/utils/common/documents/jcgm/JCGM_200_2012.pdf
5. U.S. Department of Health and Human Services, Food and Drug Administration, Form 483
Observation #11 Control of Inspection, Measuring, and Test Equipment, Commander S.
Creighton, Consumer Safety Officer. Issued to St. Jude Medical IESD, Sylmar CA. October 17,
2012.
http://www.fda.gov/downloads/aboutfda/centersoffices/officeofglobalregulatoryoperationsandpol
icy/ora/oraelectronicreadingroom/ucm328488.pdf

Response (November 7, 2012) Observation #11


http://www.fda.gov/downloads/AboutFDA/CentersOffices/OfficeofGlobalRegulatoryOpe
rationsandPolicy/ORA/ORAElectronicReadingRoom/UCM334747.pdf

Response (March 13, 2013) Observation #11a


http://www.fda.gov/downloads/AboutFDA/CentersOffices/OfficeofGlobalRegulatoryOpe
rationsandPolicy/ORA/ORAElectronicReadingRoom/UCM346876.pdf

6. J. Bucher, Measure for Measure - Out of Sync. American Society for Quality (ASQ)
Measurement Quality Division, The Standard, Vol. 27 No. 2, PDF pp 21-22. June 2013.
http://rube.asq.org/measure/2013/05/the-standard-june-2013.pdf
6a. The preceding paper was also published in ASQ Quality Progress, pp 52-53. March 2010.
http://asq.org/quality-progress/2010/03/measure-for-measure/out-of-sync.html
7. J. Bucher, Debunking The Two Great Myths About Calibration: Traceability to NIST: If You
Cannot Adjust, You Cannot Calibrate. Proceedings of the NCSL International Workshop and
Symposium, National Harbor MD. Aug 2011. https://www.ncsli.org/c/f/p11/48.299.pdf
8. J. Bucher., Where Does It Say That? Clearing Up the FDAs Calibration Requirements.
American Society for Quality, Measurement Quality Division, The Standard, Vol. 27 No. 2, PDF
pp 31-32. June 2013. http://rube.asq.org/measure/2013/05/the-standard-june-2013.pdf

2015 NCSL International Workshop & Symposium

8a. The preceding paper was also published in ASQ Quality Progress, pp 50-51. November 2010.
http://asq.org/quality-progress/2010/11/measure-for-measure/where-does-it-say-that.html
9. NCSL RP-1:2010, Recommended Practice: Establishment and Adjustment of Calibration
Intervals, NCSL International, Boulder CO. 2010.
http://www.ncsli.org/I/i/Store/rp/iMIS/Store/rp.aspx?hkey=bf3e3957-f502-484d-9842fa5ef6325073
10. B. Weiss, Does Calibration Adjustment Optimize Measurement Integrity?. Proceedings of
the National Conference of Standards Laboratories Workshop and Symposium, Albuquerque
NM. August 1991. http://legacy.library.ucsf.edu/tid/jlw43b00/pdf
11. D. Shah, Deming Funnel Experiment and Calibration Over Adjustment: New Innovation?
American Society for Quality, ASQ World Conference on Quality and Improvement
Proceedings, Vol. 61, Orlando FL. April 2007. http://asq.org/qic/display-item/?item=21074
12. S. Prevette, Dr. Demings Funnel Experiment, Symphony Technologies Pvt Ltd, Rule 2
Example Periodic Calibrations, Pune India.
www.symphonytech.com/articles/pdfs/spfunnel.pdf
13. D. Abell, Do You Really Need a 17025 Accredited Calibration?. Proceedings of the NCSL
International Workshop and Symposium, Tampa FL. August 2003.
14. G Payne, Measure for Measure: Calibration: What Is It? ASQ Quality Progress, American
Society for Quality. pp 72-76. May 2005.
http://asq.org/quality-progress/2005/05/measure-for-measure/calibration-what-is-it.html
15. D. Jackson, Calibration Intervals New Models and Techniques, Naval Surface Warfare
Center Corona Division, Proceedings of the Measurement Science Conference, Anaheim CA.
January 2002.
16. C. Hamilton, Y. Tang, Evaluating the Uncertainty of Josephson Voltage Standards.
Metrologia Vol. 36 No. 1, pp 53-58. February 1999.
https://www.researchgate.net/profile/Y_Tang2/publication/231103850_Evaluating_the_uncertainty_of_Jo
sephson_voltage_standards/links/54abe6cf0cf25c4c472fb877.pdf

17. C. Hamilton, L. Tarr. Projecting Zener DC Reference Performance Between Calibrations.


IEEE Transactions on Instrumentation and Measurement. Vol. 52 No. 2, pp 454-456. April 2003.
http://vmetrix.home.comcast.net/~vmetrix/ZenerP.pdf
18. NASA-HDBK-8739.19-2. Measuring and Test Equipment Specifications, NASA
Measurement Quality Assurance Handbook ANNEX 2. National Aeronautics and Space
Administration. July 2010. https://standards.nasa.gov/documents/viewdoc/3315777/3315777
19. Fluke 8508A Digital Multimeter, Extended Accuracy Specifications, Publication 1887212 DENG-N Rev C, DS263. Fluke Corporation. October 2002.
http://media.fluke.com/documents/8508A_Extended_Specs_Rev_C.pdf
20. D. Deaver. Having Confidence in Specifications. Proceedings of the Measurement Science
Conference. Newport Beach CA. 2004. http://assets.fluke.com/appnotes/calibration/msc04.pdf
21. M. Dobbert. Setting and Using Specifications An Overview. Proceedings of the 2010
NCSL International Workshop and Symposium. Providence RI. July 2010.

2015 NCSL International Workshop & Symposium

21a. A version of the preceding paper was also published in NCSLI Measure The Journal of
Measurement Science. Vol. 5 No. 3, pp 68-73. September 2010.
http://www.keysight.com/upload/cmc_upload/All/Setting_Using_Specifications.pdf
22. P. Reese, J. Harben. Implementing Strategies for Risk Mitigation in the Modern Calibration
Laboratory, Proceedings of the NCSL International Workshop and Symposium, National Harbor
MD. August 2011.
https://www.researchgate.net/profile/Paul_Reese2/publication/258311599_Implementing_Strategies_for_
Risk_Mitigation_In_the_Modern_Calibration_Laboratory/file/e0b49527c29c941b2e.pdf

23. P. Reese, J. Harben. Risk Mitigation Strategies for Compliance Testing, Measure The
Journal of Measurement Science, NCSL International Vol.7, No.1 pp 38-49. March 2012
https://www.researchgate.net/profile/Paul_Reese2/publication/258311819_Risk_Mitigation_Strategies_fo
r_Compliance_Testing/file/e0b49527c24ec50664.pdf

24. NASA KNPR 8730.1 (Rev. Basic-1), Kennedy NASA Procedural Requirements, Section 3.3
Calibration Intervals, pp 11. National Aeronautics and Space Administration. March 2003.
25. Navy OPNAV 3960.16A, Navy Test, Measurement, and Diagnostic Equipment (TMDE),
Automatic Test Systems (ATS), and Metrology and Calibration (METCAL), Section 6 Policy,
U.S. Navy, paragraph (o), pp 6. August 2005.
http://doni.daps.dla.mil/Directives/03000%20Naval%20Operations%20and%20Readiness/03900%20Research,%20Development,%20Test%20and%20Evaluation%20Services/3960.16A.pdf

26. Albright J., Thesis: Reliability Enhancement of the Navy Metrology and Calibration
Program, Naval Postgraduate School, Monterey CA. December 1997.
https://calhoun.nps.edu/bitstream/handle/10945/8906/reliabilityenhan00albr.pdf
27. USAF TO 00-20-14, Technical Manual Air Force Metrology and Calibration Program.
Section 3.4 Calibration Intervals, pp 3-8. Secretary of the United States Air Force, September
2011. www.wpafb.af.mil/shared/media/document/AFD-120724-063.pdf
28. GAO LCD-77-427, B-160682, A Central Manager is Needed to Coordinate the Military
Diagnostic and Calibration Program. Appendix I Different Criteria Used To Establish
Calibration Intervals at Metrology Centers, U.S. General Accounting Office, pp 1-2, May 1977.
http://gao.justia.com/national-aeronautics-and-space-administration/1977/5/a-central-manager-is-neededto-coordinate-the-military-diagnostic-and-calibration-program-lcd-77-427/LCD-77-427-full-report.pdf

29. AR-750-43, Maintenance of Supplies and Equipment Army Test, Measurement, and
Diagnostic Equipment, Chapter 6, Section I Program Objectives and Administration,
Paragraph 6-1a Program Objectives, pp 24. Department of Defense, U.S. Army. Jan 2014.
http://www.apd.army.mil/pdffiles/r750_43.pdf
30. B. McCullough, B. Wilson. On the Accuracy of Statistical Procedures in Microsoft Excel
97. Computational Statistics & Data Analysis. Vol. 31 No. 1, pp 27-37. July 1999.
http://users.df.uba.ar/cobelli/LaboratoriosBasicos/excel97.pdf
31. L. Knsel. On the accuracy of the statistical distributions in Microsoft Excel 97.
Computational Statistics & Data Analysis. Vol. 26 No. 3, pp 375-377. January 1998.
http://www.sciencedirect.com/science/article/pii/S0167947397817562

2015 NCSL International Workshop & Symposium

32. B. McCullough, B. Wilson. On the Accuracy of Statistical Procedures in Microsoft Excel


2000 and Excel XP. Computational Statistics & Data Analysis. Vol.40 No. 4, pp 713-721.
October 2002.
https://www.researchgate.net/publication/222672996_On_the_accuracy_of_statistical_procedures_in_Mi
crosoft_Excel_2000_and_Excel_XP/links/00b4951c314aac4702000000.pdf

33. B. McCullough, B. Wilson. On the Accuracy of Statistical Procedures in Microsoft Excel


2003. Computational Statistics & Data Analysis. Vol.49. No. 4, pp 1244-1252. June 2005.
http://www.pucrs.br/famat/viali/tic_literatura/artigos/planilhas/msexcel.pdf
34. L. Knsel. On the accuracy of statistical distributions in Microsoft Excel 2003.
Computational Statistics & Data Analysis, Vol. 48, No. 3, pp 445-449. March 2005.
http://www.sciencedirect.com/science/article/pii/S0167947304000337
35. B. McCullough, D. Heiser. On the Accuracy of Statistical Procedures in Microsoft Excel
2007. Computational Statistics & Data Analysis. Vol.52. No. 10, pp 4570-4578. June 2008.
http://users.df.uba.ar/mricci/F1ByG2013/excel2007.pdf
36. A. Yalta. The Accuracy of Statistical Distributions in Microsoft Excel 2007. Computational
Statistics & Data Analysis. Vol. 52 No. 10, pp 4579 4586. June 2008.
http://www.sciencedirect.com/science/article/pii/S0167947308001618
37. B. McCullough. Microsoft Excels Not The Wichmann-Hill Random Number Generators.
Computational Statistics and Data Analysis. Vol.52. No. 10, pp 4587-4593. June 2008.
http://www.sciencedirect.com/science/article/pii/S016794730800162X
38. G. Melard. On the Accuracy of Statistical Procedures in Microsoft Excel 2010.
Computational Statistics. Vol.29 No. 5, pp 1095-1128. October 2014.
http://homepages.ulb.ac.be/~gmelard/rech/gmelard_csda23.pdf
39. L. Knsel. On the Accuracy of Statistical Distributions in Microsoft Excel 2010. Dept. of
Stats. - University of Munich, Germany. http://www.csdassn.org/software_reports/excel2011.pdf
40. M. Foley. About That 1 Billion Microsoft Office Figure. All About Microsoft. ZDNet
June 2010. http://www.zdnet.com/article/about-that-1-billion-microsoft-office-figure/
41. NIST Statistical Reference Database (StRD). National Institute of Standards and
Technology. Information Technology Laboratory - Statistical Engineering Div. November 2003.
http://www.itl.nist.gov/div898/strd/
42. P. L'Ecuyer, R. Simard, TestU01: A C Library for Empirical Testing of Random Number
Generators. ACM Transactions on Mathematical Software. Vol. 33 No. 4, article 22, pp 22:1
22:40. August 2007. http://www.iro.umontreal.ca/~lecuyer/myftp/papers/testu01.pdf
http://simul.iro.umontreal.ca/testu01/tu01.html (current version 1.2.3, 18 August 2009).
43. G. Marsaglia. The Marsaglia Random Number CDROM Including the Diehard Battery of
Tests of Randomness. Florida State University - Department of Statistics and Supercomputer
Computations Research Institute. 1995. http://www.stat.fsu.edu/pub/diehard/
44. M. Matsumoto, T. Nishimura. Mersenne Twister: A 623-Dimensionally Equidistributed
Uniform Pseudo-Random Number Generator. ACM Transactions on Modeling and Computer
Simulation. Vol.8 No. 1, pp 3-30. January 1998.
http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/ARTICLES/mt.pdf

2015 NCSL International Workshop & Symposium

45. B. McCullough. A Review of TestU01. Journal of Applied Econometrics. Vol. 21 No. 5, pp


677-682. July/August 2006. http://www.pages.drexel.edu/~bdm25/testu01.pdf
46. B. Wichmann, I. Hill. Algorithm AS 183: An Efficient and Portable Pseudo-Random
Number Generator. Applied Statistics. Vol. 31 No. 2, pp 188-190. June 1982.
https://www.researchgate.net/publication/243774153_Algorithm_AS_183_An_efficient_and_portable_ps
eudo-random_number_generator

47. B. Wichmann, I. Hill. Generating Good Pseudo-Random Numbers. Computational Statistics


& Data Analysis. Vol.51 No. 3, pp 1614-1622. December 2006.
https://www.researchgate.net/publication/220055967_Generating_good_pseudo-random_numbers.

47a. A long version of the preceding paper (w/software and results from BigCrush TestU01) is
available from NPL with margin notes and additional appendices regarding implementation of
the enhanced 4-cycle Wichmann-Hill PRNG.
http://www.npl.co.uk/science-technology/mathematics-modelling-and-simulation/mathematics-andmodelling-for-metrology/mmm-software-downloads

48. T. Symul, S. Assad, P. Lam. Real Time Demonstration of High Bitrate Quantum Random
Number Generation with Coherent Laser Light. Applied Physics Letters. Vol. 98 No. 23. June
2011. http://arxiv.org/pdf/1107.4438.pdf http://photonics.anu.edu.au/qoptics/Research/qrng.php
49. A. Yee, S. Kondo. 12.1 Trillion Digits of Pi, And Were Out of Disk Space... December 2013.
http://www.numberworld.org/misc_runs/pi-12t/
50. F. Panneton, P. LEcuyer, M. Matsumoto. Improved Long-Period Generators Based on
Linear Recurrences Modulo 2. ACM Transactions on Mathematical Software. Vol. 32 No. 1, pp
1-16. March 2006. http://www.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf
51. JCGM:101 2008. Evaluation of Measurement Data Supplement 1 to the Guide to the
Expression of Uncertainty in Measurement - Propagation of Distributions Using a Monte Carlo
Method. Joint Committee for Guides in Metrology. Working-Group 1. First Edition, 2008.
http://www.bipm.org/utils/common/documents/jcgm/JCGM_101_2008_E.pdf
52. A. Steele, R. Douglas. Simplifications from Simulations: Monte Carlo Methods for
Uncertainties. NCSLI Measure The Journal of Measurement Science. Vol. 1 No. 2, pp 56-68.
June 2006. http://www.ncsli.org/I/mj/dfiles/NCSLI_Measure_2006_June.pdf
52a. A version of the preceding paper was also published in the 2005 Proceedings of the NCSL
International Workshop & Symposium, Washington D.C. August 2005.
53. P. Reese. Personal communications with P. LEcuyer & R. Simard via email. April 2015.
54. H. Castrup. Calibration Requirements Analysis System. Proceedings of the 1989 NCSL
Workshop and Symposium, Denver CO. July 1989.
http://www.isgmax.com/articles_papers/ncsl89.pdf
55. NCSLI LM-5. Laboratory Management Publication: Benchmark Survey - 2007. Sponsored
by Boing Company. NCSLI International 182 Benchmarking Programs Committee. Boulder
CO. 2007. http://www.ncsli.org
56. AIAG MSA-4. Measurement Systems Analysis Reference Manual. Automotive Industry
Action Group (AIAG) MSA Work Group. Chrysler Group LLC, Ford Motor Company, General
Motors Corporation. ISBN 978-1-60-534211-5. Fourth Edition. June 2010.
http://www.aiag.org/source/Orders/prodDetail.cfm?productDetail=MSA-4
2015 NCSL International Workshop & Symposium

57. ISO/TS 16949:2009. Quality management systems -- Particular Requirements for the
Application of ISO 9001:2008 for Automotive Production and Relevant Service Part
Organizations. International Organization for Standardization. 2009.
http://www.ts16949.com/a55aeb/ts16949.nsf/layoutB/Home+Page?OpenDocument
58. T. Nolan, P. Provost. Understanding Variation. ASQ Quality Progress. American Society for
Quality. Vol. 23 No. 5. May 1990. http://www.apiweb.org/UnderstandingVariation.pdf
59. J. Bucher. The Quality Calibration Handbook: Developing and Managing a Calibration
Program. American Society for Quality, Quality Press. ISBN-13: 978-0-87389-704-1. 2007.
http://asq.org/quality-press/display-item/?item=H1293
60. J. Bucher. The Metrology Handbook. American Society for Quality, Measurement Quality
Division. ASQ Quality Press. ISBN 0-87389-620-3. 2004.
http://asq.org/quality-press/display-item/?item=H1428

2015 NCSL International Workshop & Symposium

APPENDIX A
Monte Carlo Modeling of EOPR
For Various Adjustment Thresholds Under Drift and Random Variation

Figure A1. Example of first ten iterations of the Monte Carlo simulation
Conditions:
Drift set to 10 % of Specification (Random: = 50.038 % of Spec).
Adjustment Threshold set to 80 % of Spec.
To facilitate a step-by-step understanding of the model, the first 10 iterations are shown in Figure
A1, the first five of which are described in detail below.
Iteration #1.
Initial Conditions are set to 0 % observed error and 0 % attribute bias. Since the observed error
(0 %) is less than 100 % of spec, the UUT is declared In-Tolerance. Since the observed error (0
%) is also less than the adjustment threshold of 80 %, no discretionary adjustment is made. The
as-left attribute bias is 0 % of specification. The UUT is returned to the customer. During the
course of this calibration interval, a random error associated with a normal distribution manifests
(+81.2 % of Spec). Additionally, a systematic drift error also manifests (+10 % of spec). These
two errors are additive, resulting in a net error of +91.2 % of spec. At the end of the calibration
interval (End-of-Period), the observed error observed for the UUT is +91.2 % of spec. Note that
only 10 % of this error is due to systematic drift. Thus, the true bias error of the UUT is only
+10 %. The additional +81.2 % error arose from a random error. If a proper adjustment was to be
made at the end of this interval, only a -10 % adjustment should be made to correct only the
systematic attribute bias due to the drift over this calibration interval.

2015 NCSL International Workshop & Symposium

Iteration #2. The UUT is received with an observed error of +91.2 % of spec. It is not known to
the calibration technician how much of the observed +91.2 % error is due to drift (bias) and how
much is due to random behavior. Since this error is less than 100 % of spec, the UUT is declared
In-Tolerance. However, since the observed error of +91.2 % is also greater than the adjustment
threshold of 80 %, a discretionary adjustment is made. The technician makes an adjustment of 91.2 % in an attempt to correct for the observed error. A proper adjustment would have only
been -10 % to compensate only for the cumulative systematic drift over the first interval. But this
is not possible since the only information available at the time of adjustment is the observed error
of +91.2 %. Thus, the adjustment overcompensates by -81.2 % and the UUT is returned to the
customer with an actual attribute bias of -81.2 %. During the course of this calibration interval, a
random error associated with a normal distribution manifests (+7.5 % of spec). Additionally, a
systematic drift error also manifests (+10 % of spec). These two errors are additive, resulting in a
net error of +17.5 % of spec. However, the previous calibration adjustment left the UUT with a 81.2 % systematic bias. Therefore, this pre-existing -81.2 % attribute bias is also added to the
+17.5 % error, resulting in a net observed error at the End-Of-Period of -63.7 %. Note that only a
-71.2 % error is due to systematic effects (-81.2 % from the overcompensated adjustment and
+10 % drift from the second interval). If a proper adjustment was to be made at the end of this
interval, only a +71.2 % adjustment should be made to correct exclusively for the systemic
attribute bias due to the overcompensated adjustment and the drift during this second interval.
Iteration #3. The UUT is received with an observed error of -63.7 % of spec. It is not known to
the calibration technician how much of the observed -63.7 % error is due to systematic effects
(bias) and how much is due to random behavior. Since the observed error is less than 100 % of
spec, the UUT is declared In-Tolerance. Moreover, the observed error of -63.7 % is less than the
adjustment threshold of 80 %; therefore, no discretionary adjustment is made. The UUT is
returned to the customer with an actual attribute bias of -71.2 %. During the course of this
calibration interval, a random error associated with a normal distribution manifests (+44.8 % of
spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are
additive, resulting in a net error of +54.8 % of spec. However, the previous calibration left the
UUT with a -71.2 % systematic bias. Therefore, this pre-existing -71.2 % attribute bias is added
to the +54.8 % error, resulting in a net observed error at the End-Of-Period of -16.5 %. Note that
only a -61.2 % error is due to systematic effects (-71.2 % and an additional +10 % drift from this
third interval). If a proper adjustment was to be made at the end of this interval, only a +61.2 %
adjustment should be made to correct exclusively for the systemic attribute bias.
Iteration #4. The UUT is received with an observed error of -16.5 % of spec. It is not known to
the calibration technician how much of the observed -16.5 % error is due to systematic effects
(bias) and how much is due to random behavior. Since the observed error is less than 100 % of
spec, the UUT is declared In-Tolerance. Moreover, since the observed error of -16.5 % is less
than the adjustment threshold of 80 %, a discretionary adjustment is not performed. The UUT is
returned to the customer with an actual attribute bias of -61.2 %. During the course of this
calibration interval, a random error associated with a normal distribution manifests (-39.7 % of
spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are
additive, resulting in a net error of -29.7 % of spec. However, the calibration adjustment left the
UUT with a -61.2 % systematic bias. Therefore, this pre-existing -61.2 % attribute bias is added
to the -29.7 % error, resulting in a net observed error at the End-Of-Period of -90.9 %. Note that
the attribute bias is only -51.2 % due to systematic effects (-61.2 % and an additional +10 %

2015 NCSL International Workshop & Symposium

drift from this fourth interval. If a proper adjustment was to be made at the end of this interval,
only a +51.2 % adjustment should be made to correct exclusively for the systemic attribute bias.
Iteration #5. The UUT is received with an observed error of -90.9 % of spec. It is not known to
the calibration technician how much of the observed -90.9 % error is due to systematic effects
(bias) and how much is due to random behavior. Since the observed error is less than 100 % of
spec, the UUT is declared In-Tolerance. However, since the observed error of -90.3 % is greater
than the adjustment threshold of 80 %, a discretionary adjustment is made. The technician
makes an adjustment of +90.9 % in an attempt to correct for the observed error. A proper
adjustment would have only been +51.2 % to compensate only for the systematic attribute bias.
But this is not possible since the only information available at the time of adjustment is the
observed error of -90.9 %. Thus, the adjustment overcompensates by +39.7 % and the UUT is
returned to the customer with an actual attribute bias of +39.7 %. During the course of this
calibration interval, a random error associated with a normal distribution manifests (+115.8 % of
spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are
additive, resulting in a net error of +125.8 % of spec. However, the previous calibration left the
UUT with a +39.7 % systematic bias from the previous adjustment. Therefore, this pre-existing
+39.7 % attribute bias is added to the +125.8 % error, resulting in a net observed error at the
End-Of-Period of +165.5 %. Note that the attribute bias is only +49.7 % due to systematic
effects (+39.7 % bias from the previous adjustment, and another +10 % drift from this fifth
interval). If a proper adjustment was to be made at the end of this interval, only a -49.7 %
adjustment should be made to correct exclusively for the systemic attribute bias due to the
overcompensated adjustment and the drift over this fifth calibration interval. The UUT will
arrive in the calibration lab at the beginning of the 6th iteration with an actual attribute bias of 49.7 %, but with an observed error of +165.5 %.
The cyclic process described above is repeated for 100 000 iterations and the EOPR is computed.
The 100 000 iteration cycle is repeated 9 more times and the average of the ten EOPR values is
taken as the final estimate of one EOPR value for use in the 101 x 101 matrix. This entire
process is then repeated 10 200 times (1.02 x 1010 total iterations) to complete the matrix, shown
in Figures 10A and 10B.

2015 NCSL International Workshop & Symposium

APPENDIX B
On the Use of Microsoft Excel for Monte Carlo Methods
The use of Microsoft Excel as a serious scientific platform for statistical analysis has many
detractors as well as a long history of critique by the statistical community [30-39]. However, it
remains one of the most widely used of all software tools in current use. As of 2010, an
estimated 750 million copies have been installed as part of the MS Office suite [40].
Excel may arguably be described by the principle of Maslows Hammer, often stated as, If all
you have is a hammer, everything looks like a nail. Excel is undoubtedly utilized in many
situations where a more appropriate or efficient tool exists. Yet, this observation alone does not
preclude Excels utility in a wide array of diverse applications. The flexibility and ubiquitous
nature of Excel may be more analogous to a Swiss Army Knife than a hammer. It may not be the
best tool for any job, but it can be an acceptable tool for many jobs, especially when
precautionary measures are taken to ensure acceptable performance. Given the Visual Basic for
Applications (VBA) programming environment in Excel, it can be a powerful option.
Like any software, it should be confirmed via objective evidence that Excel will provide
trustworthy, accurate results with an acceptable degree of confidence. This gives rise to
validation requirements in some critical applications to ensure that computations are being
performed correctly with an acceptable degree of accuracy. In this regard, Excel is no different
than any other software package. Its built-in functions, user-defined functions, logic, equations,
etc. should be validated to the extent necessary to satisfy applicable requirements. NIST provides
a Statistical Reference Database to aid in such evaluations [41].
Excel 2010
Many, but not all, of the historical criticisms regarding Excels suitability for statistical analysis
have been addressed and largely rectified with Excel 2010 [38-39]. Melard [38] has evaluated
Excel 2010s Pseudo Random Number Generator (PRNG), implemented as the RAND()
function. The RAND() function is designed to return values uniformly distributed over the range
of [0,1). Melard has shown the RAND() function in Excel 2010 to pass most modern statistical
tests for randomness, specifically a modified version of the Crush test suite in the TestU01
library by LEcuyer and Simard [42]. TestU01 has essentially superseded older series of RNG
tests, e.g. Diehard tests of Marsaglia [43] and offers a challenging battery of tests for any PRNG.
Had Melard chosen to invoke the most rigorous test suite of the TestU01 library for testing the
RAND() function in Excel 2010, called BigCrush, it would have required a very large test file of
random numbers from Excel, roughly 3 TB in size; thus, BigCrush was not performed. As it was,
the smaller 412 GB Crush test-file (~235 numbers) took two weeks to generate and 36 hours of
CPU time to run the actual Crush tests. He concludes, All tests are passed except Periods in
Strings with r = 15 and s = 15 for which the p-value is 8.107. Melard attributes these
anomalies to his specific approach in generating the test file to manage its size. Additionally,
Melard references a semi-official indication that the Mersenne Twister algorithm known as
MT19937 has been implemented for the Excel 2010 RAND() function and is assumed to be
responsible for the improved performance, compared with previous versions of Excel.

2015 NCSL International Workshop & Symposium

The Mersenne Twister (MT19937) is a somewhat modern pseudo random number generator
published in 1998 by Matsumoto and Nishimura [44]. It is now available in many mathematical
programming packages, e.g. MATLAB, Maple, R, GAUSS, SAS, SPSS, Ruby, Python, Julia,
Visual C++, etc. and presumably in Excel 2010. However, MT19937 has been shown to fail two
tests for linear complexity (r=1 and r = 29, where p <10-15) in the extensive BigCrush suite of
tests using version 1.0 of TestU01 [38, 42]. Conflictingly, other authors report that MT19937
passes all tests in BigCrush using version 1.1 [38, 45] and version 0.6.04 [47] of TestU01.
However, LEcuyer and Simard (the authors of TestU01) have confirmed that MT19937 does
indeed fail recent TestU01 tests for linear complexity and that it is well-understood why this
occurs [53]. LEcuyer et al have published that MT19937, ...successfully passed all the
statistical tests included in BigCrush of TestU01, except those that look for linear
dependencies in a long sequence of bits, such as the linear complexity tests This is in fact a
limitation of all f2-linear generators, including the Mersenne Twister Because of their linear
nature, the sequences produced by these generators just cannot have the linear complexity of a
truly random sequence. This is definitely unacceptable in cryptology but is quite acceptable
for the vast majority of simulation applications[50].
Good PRNGs are essential in order for Monte Carlo methods to yield accurate results. The same
is true for the probability distributions used in the simulations, e.g. the normal probability density
function and its inverse. In the aforementioned paper by Melard [38], the accuracy of the Excel
2010 NORM.INV function, along with many other probability distributions was also tested with
positive results. Improvements over Excel 2003 and 2007 were noted and its accuracy was on par
with other statistical applications. Melard states, On the basis of these results, Microsoft Excel
2010 appears as good as OpenOffice.org Calc 3.3. He continues, To conclude, most of the
problems of Excel raised by Yalta (2008) were corrected in the 2010 version. Regarding the
NORM.S.DIST and NORM.DIST functions of Excel 2010, Knsel [39] additionally notes, No
errors were found with these two functions5 and states Most of the errors in Microsoft Excel 97
and Excel 2003 pointed out in my previous papers have been eliminated in Excel 2010.
The preceding evidence suggests that two critical aspects of Monte Carlo simulations are
satisfied by Excel 2010, i.e. accurate statistical probability density functions and a robust random
number generator. As such, Excel 2010 may be a viable tool for investigations such as those
presented in this paper. In addition, loping functions in VBA can be used to readily process
Monte Carlo simulations in Excel. When doing so, it is sometimes helpful to change the formula
calculation option for workbooks from its default setting of automatic to manual, and then to
embed the VBA code to perform calculations (e.g. calculate) into the loop itself. Also, turning
off screen updating Application.ScreenUpdating = False can greatly reduce the time required
to perform long sequences of Monte Carlo iterations in Excel.
4

Wichmann & Hill in 2006 [47] report MT19937 passes all BigCrush tests in Version 6.0 of TestU01 (dated Jan 14
2005). However, Simard has confirmed this must have actually been version 0.6.0 (pre-official-release) where the
Linear Complexity tests used all of the first 30 bits; MT19937 would indeed pass this [53]. Later official versions of
TestU01 have two linear complexity tests that use the 1st and 30th bit of each random number, which MT19937 fails
[42, 50, 53] using version 1.0 and later. It is unknown why passing results for all BigCrush tests in Ver 1.1 of
TestU01 were reported by McCullough [45]. The current version of TestU01 is 1.2.3, dated 18 August 2009 [42].
5
Knsel does report [negligible] errors in Excel 2010s NORM.S.INV and NORM.INV functions at extremely
small probabilities (p-values) <2.2251 x 10-308.
2015 NCSL International Workshop & Symposium

Previous versions of Excel


Prior to Excel 2010, the RAND() function in Excel was not generally considered suitable for such
methods [30-39], most notably prior to Excel 2003. The release of Excel 2003 saw an improved
PRNG, with RAND() reportedly implementing the popular 3-cycle Wichmann-Hill PRNG from
1982, also known as algorithm AS 183 [46]. However, a bug in the original Excel 2003 RAND()
function caused negative numbers to be occasionally generated and Microsoft soon issued a
patch to correct this error [33, 37]. It should be noted that, even the patched 2003 version of
RAND(), as well as the version implemented in Excel 2007, was tested and found not to be a
robust implementation of the AS 183 Wichmann-Hill PRNG algorithm from 1982 [37]. Issues
with the accuracy of some probability distributions prior to 2010 have also been reported [3036]. Thus, versions of Excel prior to 2010 should be carefully evaluated when used for Monte
Carlo simulations or other statistically intensive computations, especially in critical applications
where the risk or consequences of inaccurate results is significant.
Pseudo Random Number Generators (PRNGs)
With improvements in computing power and increased abilities to test PRNGs, even a proper
implementation of the 3-cycle Wichmann-Will AS 183 algorithm from 1982 [46] can present
limitations in modern applications such as intensive Monte Carlo modeling. A new breed of
PRNGs has evolved that addresses these issues, such as the aforementioned Mersenne Twister.
However, even the best PRNGs are deterministic. That is, given the same set of initial
conditions, called the seed, a given PRNG will produce exactly the same output stream of
numbers each time it is run. If the algorithm is known, and the seed is known, the sequence of
output numbers can be exactly predicted. This is desirable in some instances (such as auditing)
but not for others (e.g. encryption). While predictability and randomness may seem mutually
exclusive, they are not necessarily so.
For example, the digits of pi (now known to more than 12 trillion digits [49]) have been
postulated to be random. Although no formal proof of pis randomness has been found to date,
neither has any regular pattern. The apparently random digits of pi are nevertheless predictable.
Such predictability does not necessarily negate the randomness inherent to the sequence of
numbers. With PRNGs, the predictability of the output stream can be somewhat (but not totally)
inhibited by introducing entropy and/or secrecy into the seed, because it will be unknown exactly
where the sequence of numbers started. The intractability of prediction and retrodiction are
requirements for cryptographically secure pseudo random number generators (CSPRNGs).
Such generators must preclude prediction of the random numbers, even though the algorithm
might be known, and long samples of output numbers are available for inspection, and even
when its state has been revealed. In the most extreme cases, truly random numbers may be
generated from quantum phenomena [48]. Monte Carlo methods do not require such generators
only that the PRNG used is robust and passes most modern statistical tests (e.g. TestU01).
Although evidence indicates that the RAND() function in Excel 2010 should be adequately
robust, it is seeded by an undocumented method, which is generally believed to be associated
with the real-time clock of the host computer. There is no direct user-control over the seeding
process. It is a volatile function, returning a different random number each time a calculation
is performed. This does not necessarily reduce the performance of the RAND() function, but it
will not provide for reproducibility which may be important for independent auditing.

2015 NCSL International Workshop & Symposium

In addition to the deterministic nature of pseudo random number generators, PRNGs do not
have infinite periods. At some point, the stream of output numbers will begin to repeat itself; a
replicate pattern will eventually emerge. Short periods are undesirable. The original 3-cycle
Wichmann-Hill PRNG (1982) has a period of ~1013 [46, 47]. This is relatively small by todays
standards and this older PRNG also fails some BigCrush tests in TestU01. A revised/enhanced
4-cycle Wichmann-Hill PRNG (2006) has a period of ~1036 [47], adequate for most any
application imaginable. Moreover, the enhanced 4-cycle Wichmann-Hill PRNG has been shown
to pass all BigCrush tests in version 0.6.06 of TestU01 [47, 47a]. It has many other desirable
properties as well and requires only 26 lines of code to implement in C; it could also be
implemented in Excel using VBA. It is regarded as a highly robust PRNG and is referenced in
Annex C of JCGM 101:20087 (GUM Supplement 1) for computing measurement uncertainty via
Monte Carlo methods [51]. Although the aforementioned MT19937 algorithm fails two
BigCrush tests in more recent versions of TestU01, it has an extremely long period of ~219937
[44] or ~106001. To fully appreciate the length of these periods, consideration of the following
large numbers is insightful:

The age of the universe is ~4.4 x 1017 seconds (13.8 billion years).
The fastest supercomputers approach ~3 x 1016 floating point operations per second.
The number of atoms in the observable universe is ~1080.
The number of Plank volumes in the observable universe is ~10185.

A paper in 2006 by Steele and Douglas8 [52, 52a] also provides a wealth of practical information
and useful insights for performing Monte Carlo simulations in Excel. While focused on
computing measurement uncertainties in Excel, the paper illustrates the usefulness of the VBA
programming environment for implementing alternative (custom) pseudo random number
generators. The authors provide VBA code for the 1982 Wichmann-Hill PRNG and include
step-by-step instructions for writing custom user-defined VBA functions. Advanced users are
referred to external Dynamic Link Libraries (DLLs) to facilitate faster execution of compiled
code in C (such as PRNGs) within Excel. Also identified in the paper are the limitations of
the 1982 Wichmann-Hill generator along with reference to a PRNG called RANLUX which
offers higher dynamic range as well as other beneficial characteristics (the authors offer to
provide VBA code for RANLUX as well as additional helpful resources). It should be noted that
the Steel and Douglas paper was written prior to the publication of the enhanced Wichmann-Hill
4-cycle PRNG (2006), prior to the final release of JCGM 101:2008 (GUM Supplement One), and
prior to the advent of Excel 2010. Nevertheless, the paper remains an excellent resource for the
researcher wishing to investigate Monte Carlo methods in Excel.

Reported by Wichmann & Hill [47] as version 6.0 of TestU01. See preceding footnote 3 regarding version 0.6.0.
JCGM 101:2008 does not exclusively recommend any particular PRNG over others. It states, Generators other
than those given in this annex can be used. Their statistical quality should be tested before use. [51]
8
This paper was also presented by Dr. Alan Steele on August 9 th at the 2005 NCSL International Workshop and
Symposium in Washington D.C., for which it won Best Paper award in Theoretical Metrology [52a].
7

2015 NCSL International Workshop & Symposium

Das könnte Ihnen auch gefallen