Sie sind auf Seite 1von 21

Rejection of Data ©

by

Kevin Lehmann
Department of Chemistry
Princeton University
Lehmann@princeton.edu

© Copyright Kevin Lehmann, 1997. All rights reserved. You are welcome to use this document in your own
classes but commercial use is not allowed without the permission of the author. The author welcomes any
constructive criticisms or other comments from either educators or students.

Goal:
Goal To demonstrate to students how a statistically objective method
for removing or rejection of data points can improve the accuracy of
estimates made from experimental data. Students will learn how to
apply two commonly used techniques, and more generally, how to use
numerical simulation of data to evaluate different proposed methods of
data analysis.

Prerequisites:
Prerequisites This worksheet assumes that the student is already
familiar with the basic concepts of probability theory and error
analysis, including the concepts of a population and a sample drawn
from that population; the mean and standard deviation of both the
population and the sample; the Gaussian distribution function; and
determination of confidence intervals based upon that distribution
function. These topics are covered in most texts that discuss
error analysis, as well as in several worksheets that I have
written and will be included in this archive (Mean_vs_Median.mcd,
Gaussian_Distribution.mcd, and Linear_Regression.mcd). The
worksheet makes extensive use of the Mathcad’s statistical
functions, and it would be helpful for the student to have on hand
a copy of manual to review what each function call does.

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 1
Introduction:
Introduction The purpose of this Mathcad Worksheet is to demonstrate
why one will sometimes want to exclude certain data points from the
calculation of population statistics, such as the mean. This is a
subject that is very poorly explained in many introductory books on Error
Analysis and Statistics. The goal of any statistical analysis is to come
up with the best estimate of the "true" value of a quantity that is
measured with random error, along with a realistic estimate of the likely
uncertainty in that measured quantity. The criteria for deciding between
competing methods of analysis should be based on whether the analysis
methods meet the objective criteria of statistics, not some "religious"
commitment to retaining all data.

We will compare the distribution of values calculated for a sample


mean for samples of 25 data points. Each ’value’ represents a single
measurement of some physical quantity, such as a voltage or concentration
of a solution. Each ’sample’ of 25 represents the results of a series of
measurements with all experimentally controlled variables the same, i.e.,
the measurements are expected to give identical values, except for
’noise’ which is always present in any measurement. If the measurements
are done by hand, 25 is a lot of times to redo the same measurement, 3-5
is more typical. However, for measurements made by a computer, it is
often not difficult to ’signal average’ this number of times or more.

We will generate 1,000 such data sets, picked from a known "Prior
Distribution" having a zero mean and unit standard deviation, and compare
the results of our statistical analysis of the generated data to the
statistics of the Prior Distribution. In a real experiment, we never
have precise values of the true sample mean and standard deviation, if we
did there would be no reason to do the experiment! However, the type of
’numerical simulation’ we are going to perform is a useful way to ’test’
the procedures we plan to use to analyze our real data. They are also
very useful in giving experience about what errors to expect given our
assumptions about the nature of measurement errors. From each ’set’ of
measurements, we compute an estimate of the ’true’ value of the quantity
we are trying to measure. However, because of measurement error or
noise, we will not get the precise ’prior’ value we assumed in generating
the ’synthetic data’. Further, each set of data we generate will give a
different estimate. If our statistical method is ’unbiased’, the mean
of these estimates will approach closer to our assumed prior value. The
computed standard deviation of the distribution of estimated values is
directly proportional to our uncertainty in the true value (assuming we
did not know it a priori), based upon a single data set. So now we will
start by considering data taken from a known Gaussian distribution.

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 2
N 25000 Number of Gaussian random numbers to generate

µ 0 Mean of Prior distribution

σ 1 Standard deviation of Prior distribution

Ns 25 Number of data points per sample

N
Nt Number of different samples or sets of data points
Ns

j 0 .. N s 1 Range variable for data points per sample

k 0 .. N t 1 Range variable for samples

<k >
y rnorm N s , µ , σ Mathcad function that generates Ns Gaussian
Random Numbers, with mean µ, and standard
deviation σ. These values are put into a
two dimensional matrix. Each data sample
of 25 points is one column of the matrix.
In the empty space to the right ask Mathcad to show the y
matrix by typing y= . How many columns are in this array?
How many rows?

3
mean( y ) = 1.282 10
Here we check the statistics of the
total set, y, of random numbers, using
stdev( y ) = 1.00108 Mathcad functions.

Why are these values for the mean and standard deviation of the set of numbers we generated
not exactly equal to the values of µ and σ that we used to generate y?

Now lets generate statistics for each one of our sets of Ns data
points, like we would do if they were real measurements from the lab.

<k >
avg k mean y The calculated mean of the k’th sample of 25
points

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 3
sk is the standard deviation (SD) of
the k’th sample. The square root factor
is because the Mathcad function stdev
Ns
sk . stdev y<k > divides by N s instead of Ns 1 and
Ns 1
we need the latter for finite samples
of data.

Here we define the function rms. this


rms( x ) mean ( x. x ) function will return the root mean squared
of a set of numbers contained in a vector x.

Here we used the RMS definition for the array s.


rms( s ) = 0.999
Notice that the RMS value of the the set of
sample SD’s almost equals the standard deviation
of the Prior Distribution, as it should.

How many s values are in the array s?


What is the meaning of each s value?
Why do we use an rms average instead of a traditional mean?
(Hint: calculate the mean of s and compare it to what you would
have expected)

From each set of data points, we have computed a sample mean value,
avgk, and a sample standard deviation, sk. Next we compute the the
statistics (mean and standard deviation in the 1000 averages) of the
set of avgk we have determined.

3
mean( avg ) = 1.282 10 This is exactly the same as the mean
of y calculated above

stdev( avg ) = 0.208 The computed standard deviation for the generated
data is smaller than that of the Prior Distribution
1
by close to the expected factor of Ns = 0.2

Let us now look at plots of the distribution of both data


points and means

i 0 .. 100 Set the range variable to define the bins for the histogram plot

xi 3. 0.06. i Vector to define bins for histograms

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 4
1
hdata hist( x , y ) . ( N . 0.06 ) hist(x,y) generates a vector whose i’th
value counts the number of y values that
occur between xi and xi+1. The factor after
1
hmean hist( x, avg ) . 0.06. N t hist(x,y) is to normalize the histogram to
unit area.
Ask Mathcad to display hdata and hmean. Compare these
vectors to the ordinate of the Distribution of data figure
below. What is a probability density?

Need to redefine the i range variable as there is one


i 0 .. 99
less histogram value than elements of x

Distribution of Data and Sample Means


2

1.5
Probability Density

0.5

0
3 2 1 0 1 2 3
(Value - Mean)/Standard Deviation
Data
Mean

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 5
Let us now apply the Chauvenet’s criteria for
rejection of data.

We we now consider criteria for removing or ’rejecting’ data points


that appear to be ’erroneous’. One of the most popular methods (at
least in introductory text books on error analysis) is Chauvenet’s
criteria. We will now apply this criteria to our data sample, and
see how this changes the distribution of average values. To use
Chauvenet’s criteria, we compute the mean and standard deviation of
each set of data. We then assume that the errors are normally
distributed, which is what we designed our original data to be. We
will ’reject’ a data point if we find that the probability of
finding even one point in a sample of Ns points that far from the
sample mean is less than 1/(2Ns). Thus, if our sample contains 25
data points, we will reject a data point if it is more than 2.5
standard deviations from the mean, because the theory of Gaussian
distributed errors shows that the probability of finding a point 2.5
or greater standard deviations from the mean is 2%. After we have
’rejected’ data points (if any) using this criterion, we compute an
estimate for the sample mean and standard deviation from the points
that are left.
1
reject_prob We reject a data point if the probability of
.
2N s being that distance or a greater distance from
the sample mean is less than the selected reject
cutoff probability.
For a Gaussian distribution, the
range qnorm( 0.5 . reject_prob , µ , σ ) variable ’range’ returns the
normalized deviation from the mean
range = 2.326 (in terms of the standard deviation)
that contains a fraction,
reject_prob, of the data points
<k >
We will reject data points such that y j
avgk > range. sk

Display the value for the right side of the inequality. How large is this vector and what does it
contain?

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 6
______________________________________________________________________

Aside:
Aside
One may be tempted to be more judicious in the rejecting data by
removing only data that falls even further from the mean. However,
one must recognize that it is simply not possible, mathematically,
for a data point to be further than N s. sk from the mean. Thus, if
we make the interval too large, we will never remove a data point.
The basic problem is that the value for sk will be dominated by any
outliers in the data set. If we have an independent estimate for
the expected standard deviation of the data, it is far better to
use that in the test, especially if N s is not large . This can
often be done since we typically make many sets of measurements
with the same apparatus, as we vary some parameter(s). Unless we
have grounds to think otherwise, one should assume that the
statistical character of the fluctuations do not change from one
data set to the next. Thus, we can combine the individual sk values
to give a better overestimate of the σ for the distribution.
Another useful technique to use when we expect outliers is to
estimate σ from the mean absolute deviation of the data from the
sample mean. For a Gaussian distribution, the expectation value of
the mean absolute error is 1.253 σ. The advantage is that the
mean absolute error is much less effected by the outliers in the
data set than the root mean squared error. In particular, it is
possible for a data point to have an absolute deviation of as large
as Ns/2 times the mean value.

Exercise: Consider a set of N s - 1 points with value 0., and


one point with value 1.
1. What is the mean and standard deviation of this distribution?
2. How many ’sigma’ away from the mean is the point at 1?
3. Compare the value for s estimated from both standard deviation
and the mean absolute error of the sample?
4. If the point at 1 is ’erroneous’, which method gives a better
estimate of s?

End of Aside
___________________________________________________________________________

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 7
We will make use Mathcad’s Boolean functions which are of the
form (x > y). This is evaluated as 1. if the condition inside
<k >
the parentheses holds, 0 if not. Thus, y j
avgk > range. sk =
<k >
1 if y j
is to be rejected and zero if not.

<k >
N_rejectk y j
avg k > range. sk Number of data points to
j be rejected in the k’th
sample

Ns N_rejectk = Number of data points left in k’th sample

Display N_reject. How many data points are rejected from each set.

mean( N_reject ) = 0.363 Average number of data points rejected per


sample; this is less than the 0.5 because
in samples with a point far from the mean,
we will overestimate σ by the sample
standard deviation sk.

avg2 is a vector of the mean of our data points left after


filtering data by Chauvenet’s criterion

1 . <k > <k >


avg2k y j
avgk range. sk . y j
Ns N_rejectk
j
The Boolean function restricts sum to only those points not
rejected

Let us now compare how the our ’estimates’ of the true sample mean
compare with the Prior value of zero.

3 The mean of our sample averages after


mean( avg2 ) = 3.489 10
filtering

3 The mean of our sample averages without


mean( avg ) = 1.282 10
filtering.

Standard Deviation of the sample averages with


stdev( avg2 ) = 0.216
filtering

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 8
stdev( avg ) = 0.208
Standard Deviation of the sample averages
without filtering.

The value for standard deviation of avg2 is larger than for


avg. Based upon this result, do you suggest using Chauvenet’s
criteria to reject data?

Let us make a plot to compare to the distribution of mean before


and after filtering data
i 0 .. 100 xi 1. 0.02. i i 0 .. 99 Setting up bins

1
hc hist( x , avg2 ) . 0.02. N t Mean Distribution of filtered data

1
hmean hist( x, avg ) . 0.02. N t Mean Distribution of initial data

Distributions of Sample Mean


3
Probabilty Density

0
1 0.5 0 0.5 1
Error of Mean / Sample S.D.
Mean of Filtered Data
Mean of Initial Data

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 9
As we can see from the graph, by rejecting some of the data, we
have slightly increased the uncertainty in the mean estimated from
a sample. If we know that our noise is described by a Gaussian or
normal distribution, as is often assumed, then we get the best
estimate of the true sample mean if we never reject data.
However, if we do apply the test, the increase in the uncertainty
in the mean is small.

So why is it useful to sometime reject some of the data? The


reason is that in the real world, the distribution of errors is
never exactly described by a Gaussian. Often, the real
distribution follows a Gaussian closely near the center, but the
probability of getting data points many standard deviations from
the mean is much larger than predicted by a Gaussian distribution.
There are many reasons for this. They come down to the ever
present possibility of a rare but substantial disruption of the
experimental apparatus. These often fall in the category of "one
over f noise", which is so named since the noise spectrum produced
by these disturbances is often found to be a spectrum of
approximately 1/f. Such noise is known to be ubiquitous in
physical systems. Even rare disturbances can introduce sizable
uncertainty in the mean, since they can pull the mean far from the
correct value.

In order to model this effect, we will multiply a random 1% of the


data points by a factor of 100. This corresponds to a population
distribution that is the sum of two Gaussian functions: a narrow
one with 99% of the population, and a one hundred times wider
Gaussian with only 1% of the population. However, when we compute
the sample variance, the broad Gaussian dominates, leading to a
standard deviation approximately ten times larger than that of the
narrow Gaussian alone. It is in such cases that methods to test
data points and reject "outliers" will lead to a substantial
improvement in the sharpness of our predictions.

λ 0.01 Fraction of data from "outlier" distribution

γ 100 Relative standard deviation of outlier distribution

yj , k yj , k. ( 1 (γ 1 ) . ( rnd( 1 ) < λ ) ) Multiply each data point by 100 a


random 1% of the time and by 1 the
other 99% of the time

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 10
mean( y ) = 0.086
This does not strongly effect the overall
mean, which is still close to the Prior
value of zero.

stdev( y ) = 9.967 The standard deviation of the total data


set is now an order of magnitude larger
than before

<k >
avg k mean y Recompute the set of sample means and
standard deviations

Ns
sk . stdev y<k >
Ns 1

Again, the RMS value of the standard deviation


rms( s ) = 9.965
nearly matches that of the total sample

The standard deviation of the mean is also


stdev( avg ) = 1.999
an order of magnitude larger than before

t-Test: We can test how often the calculated mean, avg,


contains the true sample mean (0.), within the estimated 95%
confidence interval. This test is based upon an assumption of
simple Gaussian error, which does not rigorously apply in this
case

qt 0.95 , N s 1
t_range For samples following a Gaussian
Ns 1 distribution, the absolute deviation of the
sample mean from the true mean of the
distribution will be less than t_range x sk
t_range = 0.349
(sample standard deviation) 95% of the time.

1 .
avgk > t_range. sk = 0.078 Fraction of means found outside
Nt predicted confidence interval
k

For this level of outliers, t-Test still works well, in


fact, it overestimates the size of the 95% confidence
interval.

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 11
We will now use the Chauvenet’s Criterion to "reject" data from
these samples

<k >
N_rejectk y j
avg k > range. sk
j

mean( N_reject ) = 0.495 Note that we are more likely to reject a


data point than before

1 . <k > <k >


avg2k y j
avgk < range. sk . y j
Average of remaining
Ns N_rejectk points.
j

<k > <k > 2


y j
avgk < range. sk . y j
avgk
Standard Deviation of
j
s2 k remaining pts.
Ns N_rejectk 1

4
mean( avg ) = 0.086 mean( avg2 ) = 1.614 10

stdev( avg ) = 1.999 stdev( avg2 ) = 0.321

When we use Chauvenet’s criteria to reject outliers, the


standard deviation of our estimates of the true mean of the
distribution (which is zero), is close to what we obtained
above from the narrow Gaussian distribution alone, and almost
one order of magnitude smaller than that obtained if we
retained all the data points.

If you believed that the assumptions used in constructing this


distribution function provided a good representation of your
experiments, would you recommend using Chauvenet’s criteria to
filter your laboratory data? Explain

rms( s2 ) = 2.584 The RMS value of the SD also falls by a large


factor

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 12
To get a better understanding of how use of Chauvenet’s criteria
reduces the uncertainty in our estimate of the sample mean, let us
Plot the two distributions of sample means:

i 0 .. 100 xi 3. 0.06. i Vector to define bins for the


histograms

1
hc hist( x , avg2 ) . N t . 0.06 histogram of sample mean values, using
Chauvenet’s criteria

1
hmean hist( x, avg ) . 0.06. N t histogram of sample mean values, using
all the data points.

i 0 .. 99

Distributions of Sample Mean


2
Probabilty Density

0
3 2 1 0 1 2 3
Error of Mean / Sample S.D.
Mean of Filtered Data
Mean of Initial Data

The two distributions look very similar near the center, but the
"Filtered" data distribution is higher because it does not have the
wide "wing" of highly divergent values. This last result should
not be surprising when we consider the fact that we have a
Ns
probability = ( 1 λ ) = 0.778
that a sample has not even one point from
the wider distribution. To make the wings of the distribution
clear, we will blow up the vertical scale.

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 13
Distributions of Sample Mean
0.2

0.15
Probabilty Density

0.1

0.05

0
3 2 1 0 1 2 3
Error of Mean / Sample S.D.
Mean of Filtered Data
Mean of Initial Data

Let us summarize what we have observed so far:

1. For a truly Gaussian distribution of data, the use of Chauvenet’s


criteria of data rejection produces a worse estimate of the true
sample mean. However, we only get a very modest increase in
uncertainty. Qualitatively, we would obtain the same result if we had
used any of the other available methods for rejection of outliers in
the data.

2. We then considered a distribution function of data that consists of


a narrow Gaussian distribution, but with a small percent from a much
wider distribution, corresponding to ’outliers’ in the data. We find
that even though the probability of getting a single outlier in one
set of data was small (~22%), the distribution of sample means now
also had a much wider tail, with a substantial probability of an error
much larger than that expected based upon the narrow distribution
alone.

3. By ’rejecting outliers’ in the data by Chauvenet’s, we dramatically


reduce the probability of a large error in our estimate of the true
mean. The method is not ’perfect’ in that we will not reject all data
from the wider distribution function, and we will ’reject’ some data
points that are part of the narrow Gaussian distribution. However, we
have a much better statistical estimate than the average of all the
data points, no matter how far from the mean.

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 14
A natural question to consider is should we now apply the Chauvenet’s
criterion on the pruned data set? Since the RMS sample standard
deviation has been significantly reduced, one may now find further
points that should be eliminated. We will now try this and see if we
obtain a further reduction in the range of calculated mean values.
We should recalculate the value of range, using N s N_rejectk instead
of N s, but that would require a different value for each sample (k).
Since N s>> N_rejectk, we expect small error from this approximation.

<k >
N_reject2k y j
avg2k > range. s2k
j

mean( N_reject2 ) = 0.579 Note that we are only modestly more likely
to reject a data point than before

mean( N_reject ) = 0.495

Sample mean and standard deviation after a second pass of data


rejection

1 . <k > <k >


avg3k y j
avg2k < range. s2k . y j
Ns N_reject2k
j

<k > <k > 2


y j
avg2k < range. s2 k . y j
avg2k
j
s3 k
Ns N_reject2k 1

3
mean( avg3 ) = 2.758 10 stdev( avg3 ) = 0.218 Two passes of Chauvenet’s criterion

4
mean( avg2 ) = 1.614 10 stdev( avg2 ) = 0.321 One pass of Chauvenet’s criterion

mean( avg ) = 0.086 stdev( avg ) = 1.999 Unfiltered data

rms( s3 ) = 0.997 rms( s2 ) = 2.584 rms( s ) = 9.965

Compare the mean, stdev, and rms values for each filtering pass. Where is the greatest
improvement? How significant is the improvement?

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 15
In comparing methods, we concentrate on the standard deviation of the
set of sample means, rather than the mean of sample mean values. Why
is this?

Let us Plot the mean value distributions for one and two passes:

1
hc3 hist( x , avg3 ) . N t. 0.06

Distributions of Sample Mean values


2
Probabilty Density

0
3 2 1 0 1 2 3
Error of Mean / Sample S.D.
One Pass
Two Passes

The two distributions look almost identical. Let’s plot again with
an expanded vertical axis.

Distributions of Sample Mean values


0.2
Probabilty Density

0.1

0
3 2 1 0 1 2 3
Error of Mean / Sample S.D.
One Pass
Two Passes

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 16
We see that the primary effect of the second pass is to eliminate the
few remaining mean values that are more than one sample SD from the
true mean.

The above numerical experiments demonstrate that we can make sharper


predictions from a limited set of experimental data if we adopt a
protocol to reject data points that appear to be far from the rest.
The optimal strategy depends upon the population distribution. For a
perfect Gaussian distribution, any data rejection reduces the sharpness
of the estimate of the mean. However, the degradation is modest. For
a population that includes a small fraction with a much broader spread
than the bulk of distribution, data rejection can pay large dividends.
I encourage you to adjust the value of different parameters used above,
such as λ and γ, to see how the calculated results change. Numerical
"Experiments" of the type used in this worksheet are a powerful way to
estimate the expected precision of proposed methods of data analysis,
as a function of the statistical properties of the distribution to be
sampled.

The rejection of data is often viewed negatively, as ’fudging’ or even


dishonest. This is certainly not true if the follow ’rules’ are
followed.

1. The decision to reject data is made by an objective statistical


test, not by subjective decision

2. The algorithm used to reject data should be clearly noted when the
data is reported, and all the data, including the ’rejected’ points
should be stored together.

3. The decision on the algorithm to be used should not change after the
data has been taken and subjected to a first analysis. Since it helps
to have an idea what the data population distribution is when making
such decision, it desirable to take a preliminary data set and try
different algorithms on it as part of the selection process. However,
this data should not be included in the overall statistical analysis.

When data is worked up by hand, a strongly deviant point will be


obvious to any but the most oblivious person. Often, these data points
are simply ’omitted’ or perhaps the entire data set retaken. The
latter is often a needless waste of effort, but certainly better than
averaging all data, outliers and all. This is especially true if
confidence intervals based upon assumed Gaussian statistics are
invoked.

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 17
Why is this? Think about the distribution of sample means shown in the
plots above, compared to those obtained earlier using a true Gaussian
distribution of errors.

Q-Test:
Q-Test

Cauvenet’s criteria is an efficient test for outliers when either we


are dealing with a sample of reasonable size, or we have a prior
estimate of the expected sample standard deviation. For small samples
(say N < 10), the problem is that if we estimate the standard deviation
from the sample itself, it is likely to be dominated by the possible
outlier. This is why, above, it was cautioned not to make the
probability of rejection of a data point much below 1/2N, or one will
never eliminate any points, no matter how deviant.

For small samples, a more efficient test for outliers is known


as the Q-test. One arranges the data points in ascending order.
Suppose the lowest point (i.e. the first one) looks ’far from the
pack’. We define the ratio Q = (x2 - x1)/(xN - x1), which is the
fraction of the range of data used up by the first interval. We define
PQ(Q0,N) as the probability that for a random sample of N points
selected from a Gaussian distribution that Q will be greater than Q0.
Since we are dealing with a ratio of two differences, we can compute
this probability for a standard Gaussian distribution and it will apply
to any Gaussian regardless of the mean or standard deviation. I leave
it for the reader to justify the following expression for this
probability, remembering that if the first point is a x1 and the last
at xN, then the points x2...xN-1 must be in the interval
[x1+Q0(xN-x1),xN]. I split the expression into two so it will ’fit’ on
one line.

2
0.5. x N N 2
f temp x 1 , x N , Q 0 , N e . cnorm x N cnorm x 1 Q 0. x N x1

10
2 10
N. ( N 1 ) . 0.5 . x 1
.
P Q Q 0,N e f temp x 1 , x N , Q 0 , N dx N dx 1
2. π
x1
10

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 18
For example, the probability that with 5 points, the first two will be
separated by half of the total range is calculated by:

P Q ( 0.5 , 5 ) = 0.149

Using the probability distribution function for a standard Gaussian,

2
1 . x
P( x ) exp
2
2. π

derive the above expression. Note that as the points are


independent, then the joint probability distribution function is
Pj(x1,x2...xn) = P(x1)P(x2)...P(xn). Also note that we can always
relabel the points to make x1 the smallest and xn the largest
point, but we must multiply by N(N-1), which is the number of
possible position for the lowest and highest points in the list.

Points from a general Gaussian distribution, with mean m and standard


deviation s, can be written as zi = m + s xi, where xi are points from
a standard Gaussian distribution. Show that the value of Q calculated
for [z1,z2....zn] is the same as Q calculated in terms of the
[x1,x2...xn].

The probability that the highest two are separated by the same fraction
is identical, given the symmetry of the Normal distribution. For
testing data, we are more interested in the inverse of the above
function, i.e. for what value of Q is the PQ less than some given
fraction p? Use Mathcad’s root find to get this value. We need to
pass an initial guess for Q0.

Q_test( p , N , Q ) root P Q ( Q , N ) p,Q

If we are rejecting data with a confidence interval, say 90%, we want


to know what the probability is that either the lowest or highest point
fails the Q test. This is given by the sum of the probability that the
lowest point does plus the probability that the highest point does.
This ’double counts’ cases where both the first and last interval fail
the test simultaneously, but that case is of low probability in cases
of interest. Thus if we want a 90% confidence for rejecting a point
from a distribution of N=10, this point must satisfy a Q value greater
than:
Q_test( 0.5 . ( 1 0.90 ) , 10 , 0.5 ) = 0.411

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 19
We used 0.5 as an initial guess of Q0 for the root search. You must
change this guess if the root search does not converge, or returns an
unreasonable value, say Q0 outside the region [0,1]. One can just try
different values in the PQ function until a probability close to the
desired value is found and use that as an initial guess. In this case,
the probability of both the first and last points failing the Q-test is
only 2x10-5, and thus is negligible as predicted. I leave it as an
exercise to the reader to modify the ftemp function given above to
calculate this probability. Using the Q_test function, one can get Q0
values for any confidence interval with any number of points. I
suggest that the reader perform a ’test’ of a sample data set generated
by Mathcad with known statistical properties, like we did above, and
compare the number of points rejected by the Q-test with that expected.

The Q-test is efficient as long as we have no more than one outlier in


the data set. As a result, it is best used for relatively small data
sets or where the probability of an outlier is considered small, so the
chance of two in one particular data set can be neglected. One will
only be justified in removing multiple data points if the sample size
is reasonably large, at which point one should use the Chauvenet’s
criteria.

Modify the ftemp function above to calculate the probability that both
the first and last interval simultaneously are larger than Q times
the total interval, |x1-xn|, and then calculate this probability for
Q=0.411 and n=10. Hint: In what interval (in terms of x1 and xn) must
the points x2...xn-1 now lie? The end points of this interval is the
values at which cnorm must be calculated.

For a 90% confidence interval, what is the critical value for Q to use
on data sets with 25 points? How about for a 98% confidence?

Qt Q_test 0.5 . ( 1 0.90 ) , N s , 0.3 Q t = 0.277

At the bottom of this worksheet, write the Mathcad code to apply the Q
test to the simulated data sets y<k> to reject ’outliers’. For this,
you will need to use Mathcad’s sort function to put each column of y in
ascending order:

<k > <k >


y sort y

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 20
Compare the standard deviation of the sample mean of data filtered
using the Q test to the same data filtered using Chauvenet’s criteria.
Which is better for this data? Try modifying both the number of data
points in each sample, and the probability of an outlier, and
determine empirically under what conditions one test or the other
produces a sharper distribution of the sample means.

Ns 2
y1 , k y0 , k yN s 1,k yN s 2, k
yi , k < Q t . y0 , k < Q t . yN s 1,k
yN s 1, k y0 , k yN s 1, k y0 , k
i=1
avg4k
y1 , k y0 , k yN s 1, k yN s 2,k
Ns 2 <Q t <Q t
yN s 1,k y0 , k yN s 1,k y0 , k

stdev( avg ) = 1.999 No filtering

stdev( avg2 ) = 0.321 Chauvenet’s criteria

stdev( avg4 ) = 0.571 Q-Test

It is not obvious what is the ’best’ confidence interval to take in


the Q-test. In fact, one procedure widely used is to automatically
throw out the highest and lowest points and average the rest,
corresponding to Qt = 0. Try this and compare with the other
results. Of these two methods, which one do you expect will be
better for smaller sample sizes, and which for larger?

Ns 2
1 .
avg5k yi , k
Ns 2 stdev( avg5 ) = 0.385
i=1

Go to the top of the worksheet and change Ns to 10. How do the two
methods of data rejection now compare? How do the different methods
now compare? Also try Ns equal to 5.

Often, we do not have a number of identical measurements, but rather


a series of measurements as a function of some variable that changes.
For example, in a kinetic study, we would measure one or more
concentrations verses time. We then fit the data to a model, based
upon an assumed rate law. Explain how we could modify the Chauvenet
and Q tests to apply to this situation.

Creation Date: Rejection.mcd Author: Kevin Lehmann


Modified: 10/3/97 page 21

Das könnte Ihnen auch gefallen