Beruflich Dokumente
Kultur Dokumente
by
Kevin Lehmann
Department of Chemistry
Princeton University
Lehmann@princeton.edu
© Copyright Kevin Lehmann, 1997. All rights reserved. You are welcome to use this document in your own
classes but commercial use is not allowed without the permission of the author. The author welcomes any
constructive criticisms or other comments from either educators or students.
Goal:
Goal To demonstrate to students how a statistically objective method
for removing or rejection of data points can improve the accuracy of
estimates made from experimental data. Students will learn how to
apply two commonly used techniques, and more generally, how to use
numerical simulation of data to evaluate different proposed methods of
data analysis.
Prerequisites:
Prerequisites This worksheet assumes that the student is already
familiar with the basic concepts of probability theory and error
analysis, including the concepts of a population and a sample drawn
from that population; the mean and standard deviation of both the
population and the sample; the Gaussian distribution function; and
determination of confidence intervals based upon that distribution
function. These topics are covered in most texts that discuss
error analysis, as well as in several worksheets that I have
written and will be included in this archive (Mean_vs_Median.mcd,
Gaussian_Distribution.mcd, and Linear_Regression.mcd). The
worksheet makes extensive use of the Mathcad’s statistical
functions, and it would be helpful for the student to have on hand
a copy of manual to review what each function call does.
We will generate 1,000 such data sets, picked from a known "Prior
Distribution" having a zero mean and unit standard deviation, and compare
the results of our statistical analysis of the generated data to the
statistics of the Prior Distribution. In a real experiment, we never
have precise values of the true sample mean and standard deviation, if we
did there would be no reason to do the experiment! However, the type of
’numerical simulation’ we are going to perform is a useful way to ’test’
the procedures we plan to use to analyze our real data. They are also
very useful in giving experience about what errors to expect given our
assumptions about the nature of measurement errors. From each ’set’ of
measurements, we compute an estimate of the ’true’ value of the quantity
we are trying to measure. However, because of measurement error or
noise, we will not get the precise ’prior’ value we assumed in generating
the ’synthetic data’. Further, each set of data we generate will give a
different estimate. If our statistical method is ’unbiased’, the mean
of these estimates will approach closer to our assumed prior value. The
computed standard deviation of the distribution of estimated values is
directly proportional to our uncertainty in the true value (assuming we
did not know it a priori), based upon a single data set. So now we will
start by considering data taken from a known Gaussian distribution.
N
Nt Number of different samples or sets of data points
Ns
<k >
y rnorm N s , µ , σ Mathcad function that generates Ns Gaussian
Random Numbers, with mean µ, and standard
deviation σ. These values are put into a
two dimensional matrix. Each data sample
of 25 points is one column of the matrix.
In the empty space to the right ask Mathcad to show the y
matrix by typing y= . How many columns are in this array?
How many rows?
3
mean( y ) = 1.282 10
Here we check the statistics of the
total set, y, of random numbers, using
stdev( y ) = 1.00108 Mathcad functions.
Why are these values for the mean and standard deviation of the set of numbers we generated
not exactly equal to the values of µ and σ that we used to generate y?
Now lets generate statistics for each one of our sets of Ns data
points, like we would do if they were real measurements from the lab.
<k >
avg k mean y The calculated mean of the k’th sample of 25
points
From each set of data points, we have computed a sample mean value,
avgk, and a sample standard deviation, sk. Next we compute the the
statistics (mean and standard deviation in the 1000 averages) of the
set of avgk we have determined.
3
mean( avg ) = 1.282 10 This is exactly the same as the mean
of y calculated above
stdev( avg ) = 0.208 The computed standard deviation for the generated
data is smaller than that of the Prior Distribution
1
by close to the expected factor of Ns = 0.2
i 0 .. 100 Set the range variable to define the bins for the histogram plot
1.5
Probability Density
0.5
0
3 2 1 0 1 2 3
(Value - Mean)/Standard Deviation
Data
Mean
Display the value for the right side of the inequality. How large is this vector and what does it
contain?
Aside:
Aside
One may be tempted to be more judicious in the rejecting data by
removing only data that falls even further from the mean. However,
one must recognize that it is simply not possible, mathematically,
for a data point to be further than N s. sk from the mean. Thus, if
we make the interval too large, we will never remove a data point.
The basic problem is that the value for sk will be dominated by any
outliers in the data set. If we have an independent estimate for
the expected standard deviation of the data, it is far better to
use that in the test, especially if N s is not large . This can
often be done since we typically make many sets of measurements
with the same apparatus, as we vary some parameter(s). Unless we
have grounds to think otherwise, one should assume that the
statistical character of the fluctuations do not change from one
data set to the next. Thus, we can combine the individual sk values
to give a better overestimate of the σ for the distribution.
Another useful technique to use when we expect outliers is to
estimate σ from the mean absolute deviation of the data from the
sample mean. For a Gaussian distribution, the expectation value of
the mean absolute error is 1.253 σ. The advantage is that the
mean absolute error is much less effected by the outliers in the
data set than the root mean squared error. In particular, it is
possible for a data point to have an absolute deviation of as large
as Ns/2 times the mean value.
End of Aside
___________________________________________________________________________
<k >
N_rejectk y j
avg k > range. sk Number of data points to
j be rejected in the k’th
sample
Display N_reject. How many data points are rejected from each set.
Let us now compare how the our ’estimates’ of the true sample mean
compare with the Prior value of zero.
1
hc hist( x , avg2 ) . 0.02. N t Mean Distribution of filtered data
1
hmean hist( x, avg ) . 0.02. N t Mean Distribution of initial data
0
1 0.5 0 0.5 1
Error of Mean / Sample S.D.
Mean of Filtered Data
Mean of Initial Data
<k >
avg k mean y Recompute the set of sample means and
standard deviations
Ns
sk . stdev y<k >
Ns 1
qt 0.95 , N s 1
t_range For samples following a Gaussian
Ns 1 distribution, the absolute deviation of the
sample mean from the true mean of the
distribution will be less than t_range x sk
t_range = 0.349
(sample standard deviation) 95% of the time.
1 .
avgk > t_range. sk = 0.078 Fraction of means found outside
Nt predicted confidence interval
k
<k >
N_rejectk y j
avg k > range. sk
j
4
mean( avg ) = 0.086 mean( avg2 ) = 1.614 10
1
hc hist( x , avg2 ) . N t . 0.06 histogram of sample mean values, using
Chauvenet’s criteria
1
hmean hist( x, avg ) . 0.06. N t histogram of sample mean values, using
all the data points.
i 0 .. 99
0
3 2 1 0 1 2 3
Error of Mean / Sample S.D.
Mean of Filtered Data
Mean of Initial Data
The two distributions look very similar near the center, but the
"Filtered" data distribution is higher because it does not have the
wide "wing" of highly divergent values. This last result should
not be surprising when we consider the fact that we have a
Ns
probability = ( 1 λ ) = 0.778
that a sample has not even one point from
the wider distribution. To make the wings of the distribution
clear, we will blow up the vertical scale.
0.15
Probabilty Density
0.1
0.05
0
3 2 1 0 1 2 3
Error of Mean / Sample S.D.
Mean of Filtered Data
Mean of Initial Data
<k >
N_reject2k y j
avg2k > range. s2k
j
mean( N_reject2 ) = 0.579 Note that we are only modestly more likely
to reject a data point than before
3
mean( avg3 ) = 2.758 10 stdev( avg3 ) = 0.218 Two passes of Chauvenet’s criterion
4
mean( avg2 ) = 1.614 10 stdev( avg2 ) = 0.321 One pass of Chauvenet’s criterion
Compare the mean, stdev, and rms values for each filtering pass. Where is the greatest
improvement? How significant is the improvement?
Let us Plot the mean value distributions for one and two passes:
1
hc3 hist( x , avg3 ) . N t. 0.06
0
3 2 1 0 1 2 3
Error of Mean / Sample S.D.
One Pass
Two Passes
The two distributions look almost identical. Let’s plot again with
an expanded vertical axis.
0.1
0
3 2 1 0 1 2 3
Error of Mean / Sample S.D.
One Pass
Two Passes
2. The algorithm used to reject data should be clearly noted when the
data is reported, and all the data, including the ’rejected’ points
should be stored together.
3. The decision on the algorithm to be used should not change after the
data has been taken and subjected to a first analysis. Since it helps
to have an idea what the data population distribution is when making
such decision, it desirable to take a preliminary data set and try
different algorithms on it as part of the selection process. However,
this data should not be included in the overall statistical analysis.
Q-Test:
Q-Test
2
0.5. x N N 2
f temp x 1 , x N , Q 0 , N e . cnorm x N cnorm x 1 Q 0. x N x1
10
2 10
N. ( N 1 ) . 0.5 . x 1
.
P Q Q 0,N e f temp x 1 , x N , Q 0 , N dx N dx 1
2. π
x1
10
P Q ( 0.5 , 5 ) = 0.149
2
1 . x
P( x ) exp
2
2. π
The probability that the highest two are separated by the same fraction
is identical, given the symmetry of the Normal distribution. For
testing data, we are more interested in the inverse of the above
function, i.e. for what value of Q is the PQ less than some given
fraction p? Use Mathcad’s root find to get this value. We need to
pass an initial guess for Q0.
Modify the ftemp function above to calculate the probability that both
the first and last interval simultaneously are larger than Q times
the total interval, |x1-xn|, and then calculate this probability for
Q=0.411 and n=10. Hint: In what interval (in terms of x1 and xn) must
the points x2...xn-1 now lie? The end points of this interval is the
values at which cnorm must be calculated.
For a 90% confidence interval, what is the critical value for Q to use
on data sets with 25 points? How about for a 98% confidence?
At the bottom of this worksheet, write the Mathcad code to apply the Q
test to the simulated data sets y<k> to reject ’outliers’. For this,
you will need to use Mathcad’s sort function to put each column of y in
ascending order:
Ns 2
y1 , k y0 , k yN s 1,k yN s 2, k
yi , k < Q t . y0 , k < Q t . yN s 1,k
yN s 1, k y0 , k yN s 1, k y0 , k
i=1
avg4k
y1 , k y0 , k yN s 1, k yN s 2,k
Ns 2 <Q t <Q t
yN s 1,k y0 , k yN s 1,k y0 , k
Ns 2
1 .
avg5k yi , k
Ns 2 stdev( avg5 ) = 0.385
i=1
Go to the top of the worksheet and change Ns to 10. How do the two
methods of data rejection now compare? How do the different methods
now compare? Also try Ns equal to 5.