Beruflich Dokumente
Kultur Dokumente
0004-6981/92$3.00+0.00
1991PergamonPresspie
Printedin GreatBritain.
Abstract--Procedures to estimate missing data, determine extrema, and derive uncertainties for data
collected in ambient air monitoring networks are presented. The optimal linear estimators used obtain
unbiased, minimum variance results based on the temporal and spatial correlation of the data and estimates
of sample uncertainty. The first estimator interpolates missing data. The second estimator derives extrema,
e.g. minimum and maximum concentrations, from the completed data set. Together the estimators can be
used to check the validity of monitored observations, identify outliers, and estimate regional and local
components of pollutant levels. The estimators are evaluated using data collected in urban air quality
monitoring networks in Houston, Philadelphia and St Louis.
Key word index: Distribution, data imputation, optimal estimation, statistical models, uncertainty.
I. INTRODUCTION
2. BACKGROUND
Ambient air quality monitoring networks are established for purposes which include (1) the assessment of
~(A)
m,l.s
113
114
STUARTA. BATTERMAN
Table 1. Statistical methods applicable to network data
Single monitoring site
Single pollutant
Multiple variables
Trend analysis
Analysis of distributions
Probability of exceedance
Extreme value statistics
Time series (ARIMA)
Spectral analysis
Markov-type models
Correlation analysis
Factor analysis
Generalized linear models (2)
Receptor models
Cluster analysis
Time series (ARIMAX)
Notes: (1) Includes contouring, e.g. linear (planer) and non-linear interpolation.
(2) Includes linear and non-linear regression.
(3) No studies identified using co-kriging.
(4) Could use procedure discussed with exogenous variables.
Often it is important to apportion pollutant contributions attributable to local and distant emission
sources. Long-range transport by distant sources can
provide a significant 'regional' or 'background' contribution which restricts the control options available
to local authorities. Such situations can occur with
PM-10, sulfate, ozone and other pollutants.
Approaches for separating local and regional components use either dispersion modeling or ambient
115
3. OPTIMALESTIMATORS
Statistical framework
A framework for ambient air concentrations is
developed considering a single conservative (nonreactive) pollutant measured in an urban scale monitoring network. The concentration observed at site i
and time t, C~,z, consists of three components:
(1)
Measurement uncertainty
Several approaches can be used to estimate error
V~,t. Random errors may be estimated using replicate
observations, e.g. colocated samplers, while systematic
errors can be estimated using reference or calibrated
samplers. Alternatively, errors may be estimated by
isolating uncertainties in the component measures and
then propagating their effects, e.g. using Gaussian
quadrature. Lastly, empirical means may be used. The
following examples demonstrate these approaches.
Since 1981, federal regulations have required state
and local agencies to assess the accuracy and precision
of their ambient air quality measurement systems.
Data collected in the Precision and Accuracy Reporting System (PARS) are based on blind audits using
calibration gases for continuous instruments (gases),
and colocated samples for manual instruments (TSP,
Pb, and older gas measuring instruments). PARS
116
STUART A. BATTERMAN
results, expressed as a 95% confidence interval, typically show a relative accuracy of about 10% for most
of the criteria pollutants. The precision of the measurements, obtained by repeated measures, is about
10% for 03, 12% for TSP, 20 for SO2, and 46% for
NO2 (Rhodes and Evans, 1988). These statistics represent many thousands of audits.
One theoretical study of errors in mass, flow rate
and timing measurements suggests errors about half of
that obtained in field evaluations (Evans and Ryan,
1983). Other examples of component errors estimate
filter mass measurement errors (using beta gauge
attenuation) of 3 #g m - 3 for 12-h samples (Jaklevic et
al., 1981), and biases between gravimetric and beta
gauge measurements of < 5% (Courtney et al., 1982).
With air volume errors of 5-10%, these figures yield a
total error of 10-20% at typical particle concentrations.
An empirical estimate of sampling errors is the
difference between the lowest two concentrations in
the network, assuming that these concentrations result mainly from regional sources. While imperfect,
this estimate may be useful in large monitoring networks where the two lowest concentrations can be
considered replicates. In the case studies (described
later), this procedure gave relative errors of 15-20%. A
better, but rarely available measure, is the variance
between measurements obtained from colocated samplers.
The three approaches yield relative errors in the
range of 10-46%. In most cases, error statistics are
not accurately known. Also, measurements obtained
under unusual conditions may yield much larger
errors. For example, erroneous particulate measurements can be caused by high loadings which clog
filters, unusual size distributions, and high wind
speeds which affect inlet performance.
z,= [c,,,_,..
c..,-,I
..IC~,,+...c.,,.]',
Icx,," c..,I
(2)
(4)
(5)
(6)
where T is the number of observations used to estimate M and P. Matrix P contains information regarding the spatial and temporal correlation of the data.
Assuming unbiasedness (E[V,] =0) and uncorrelated
errors (E['X,V;]=0), the best linear, unbiased and
minimum variance estimate X of the missing observations is:
'.Kt= M + P(P + R,)- 1( Z t - M).
(3)
(7)
This Bayesian estimator weights the information provided by the observations (the so-called influence
vector Z t - M) to yield the estimate X v Results will be
identical to mean M if there is zero correlation between observations, i.e. P = 0. The (posterior) error of
estimation matrix S is:
S, = E [(X t - Xt)(X,- Xt)'] = P - P(P + Rt)- 1p. (8)
(9)
At
X , . , = e x p ( X ,,t + S~/~2tw),
(10)
(11)
(12)
(13)
117
4. APPLICATION
Implementation and evaluation
Three case studies were used to evaluate the estimators. The first employed particulate data collected
in St Louis, IL from May to September 1976 as part of
the Regional Air Pollution Study (Strothmann and
Schiermeier, 1979). In this study, dichotomous samplers at 10 sites collected 12-h samples in fine and
coarse size fractions. Because of long gaps of missing
data, fewer sites are used here (Table 2). Most sites
were urban; coverage extended to about 45 km from
the city center. The second study is the Philadelphia
Area Field Study (Toothman, 1984) in which ambient
data were collected from 14 July to 13 August 1982 at
six sites. As in St Louis, dichotomous samplers collected 12-h particulate samples, also in two size fractions.
This urban area was considerably larger than St
Louis, yet the monitoring network was smaller. Most
sites were urban and industrial. The study included a
'special studies' site with impacts from a nearby oil
tank farm, local truck traffic and ongoing construction, and a rural site in New Jersey. The third case
study used O 3 observations taken at 11 sites in
Houston, TX from April to September 1987. In this
study, monitoring sites ranged over a distance of
STUARTA. BATTERMAN
118
Table 2. Number of available and deleted data in case studies. Range is shown in parentheses
St Louis
Fine particles
Coarse particles
Philadelphia
Fine particles
Coarse particles
Houston
Ozone average
Ozone peak
Average percentage of total
Number
monitoring
sites
Maximum
possible
9
7
92
92
62 (58-79)
62 (55-82)
23 (16-32)
22 (17-27)
6
6
62
62
54 (38-60)
49 (30-95)
7 (1-13)
5 (4-6)
11
11
152
152
120 (97-147)
120 (9%147)
7 (6--8)
17 (9-22)
100%
78%
14%
Lags =0
r
b
Lags = 1
r
b
Lags = 2
r
b
Philadelphia
Fine
Coarse
Fine
St Louis
Coarse
0.84
0.88
0.84
2.11
0.73
4.10
0.66
4.74
0.90
0.36
0.90
0.05
0.84
0.07
0.82
1.00
0.68
3.25
0.67
5.65
0.86
0.39
0.89
0.04
0.81
0.61
0.82
1.23
0.33
3.50
0.76
3.68
0.88
0.06
0.82
0.24
Peak
Houston
Average
119
ESTIMATED(uglm3)
ESTIMATED(uglm3)
80 I COARSEPARTICLES
80 i FINE PART'CLES
PHILADELPHIASTUDY
60
(b)
PHILADELPHIA STUDY
so
(a)
/~
4o
|
2(1
4o i
,,
,.
,,,
20
40
60
80
20
ORIGINAL(ug/m3)
ESTIMATED(ug/m3)
80
100
COARSE PARTICLES
ST. LOUISSTUDY
60
/ / ~
(c)
'
40
60
ORIGINAL(ug/m3)
ESTIMATED(ug/m3)
/
FINE PARTICLES
80 ST. LOUIS STUDY
(d)
80
/
~
.o'
40
20
20
10
20
40
60
ORIGINAL(ug/m3)
80
40
60
80
ESTIMATED(pphrn)
0
O
20
20~
12 Hr. AVERAGEOZONE
8 HOUSTONSTUDY
(e)
~/~
~
s ~
,f
-~ 1 u
/
HOURLYPEAK OZONE
HOUSTONSTUDY
158
(f)
_ ~
101
B ~
100
ORIGINAL (ug/m3)
ESTIMATED(pphm)
S ~ g
sa mz
sl
10
ORIGINAL(pphm)
u = i r l l B . ~
"
10
15
ORIGINAL (pphrn)
20
Fig. 1. Scatterplots of predicted vs estimated data for the three case studies.
using site-specific results. Even without such calibrations, the predictions preserve the actual spatial and
temporal trends using solely the data's correlation
structure.
The largest relative errors result from overprediction of very low observations, e.g. particulate concen-
120
STUARTA. BATTERMAN
Extrema estimates
C o n c e n t r a t i o n (ug/m3)
100
Observed
Estimated
80
Extreme: 0%
----
Extreme: 20%
40
20
10
Observation
15
20
Number (12 hour period)
25
30
Fig. 2. Observations, estimates and extrema envelopes for fine fraction particulate
data in Philadelphia.
C o n c e n t r a t i o n (ug/m3)
140
Observed
120
Estimated
100
/I
--
Extreme=
--
Extreme: 20%
0%
/ '
/
80
60
40
10
Observation
1~
20
Number (12 hour period)
25
30
Fig. 3. Observations, estimates and extrcma envelopes for coarse fraction particulate data in Philadelphia.
Computational demands
The computational demands of estimation depend
on m, the number of leads/lags, and n, the number of
sites. Covariance matrix P and mean vector M are
determined once for each data set. To interpolate
missing data for each sampling period requires inversion of a rank n(2m+l) matrix and several matrix
multiplications. About 2 s of computer time were
required for each sampling period using a fast
(33 MHz) 80386/7 microcomputer and a high accuracy LU decomposition inversion algorithm. Extrema
estimates require a single inversion for the entire data
set. Only simple matrix operations are needed for
each observation, thus these estimates are quickly
calculated.
5. DISCUSSION AND CONCLUSION
This paper has developed linear estimators to estimate data and extrema with the purposes of handling
missing data and accounting for errors. Extrema are
estimated from the full (estimated) data set. Estimators
of the lowest concentration in the network may represent regional levels if the monitoring system includes
sites which are largely unaffected by local sources.
Peak estimates, provided by the same estimator, ind.icate the contribution of local sources. Both estimates
should be more robust than observations from single
stations since spatial and temporal information from
all sites is utilized.
In application to three diverse data sets in St Louis,
Philadelphia and Houston, the estimators provided
121
reliable results. The high spatial and temporal correlation present in ambient pollutant levels at the urban
scale makes such estimators practicable. The estimators may be less appropriate for observations with
little correlation, e.g. some coarse fraction particulates, networks with intermittent sampling, or networks covering very large spatial scales. The Bayesian
estimator in Equation (7), which also can be viewed as
a linear contraction operator, tended to reduce the
scatter in the original data. In general, this is an
undesirable property. However, the original dispersion of the data can be restored by changing the degree
of post-whitening, altering the relative error, or by
using a different data transformation. Such networkspecific calibrations may further increase the accuracy
of the estimators.
The estimators view historical observations as imperfect (error-containing) random variables, a fundamentally different perspective than the usual assumptions that the observations are representative and
error-free. Spatial and temporal information has been
used in the opposite manner to select sites in the
optimal design of air monitoring networks (e.g.
Shindo et al., 1990). Results obtained in the case
studies imply that monitoring observations, to varying degrees, are redundant in providing site-specific
information since observations at some sites can be
used to predict concentrations at other sites.
The behavior of the estimator depends on the
relative strengths of the temporal and spatial correlation. If temporal correlation is dominant, a missing
observation both preceded and followed by valid
observations at the same site is estimated using primarily a weighted sum of leading and lagging observations at that site. If leading and lagging observations
are missing, the estimate is derived from simultaneous
observations at other sites. If spatial correlation is
dominant, results depend on simultaneous measurements taken at other sites and to a lesser extent on
leading and lagging observations. If many simultaneous measurements are missing, leading and lagging
observations and the constant (mean) are emphasized.
In each case, weights given to leading and lagging
observations can be significant, and the coefficients
are site-specific and depend on the data available. In
comparison with the estimators for missing values,
extrema estimates primarily depend on simultaneous
observations. Thus, these estimators might be simplified to use observations at only the current time. The
estimator automatically determines the weightings so
as to minimize the variance of the estimate. The
estimation procedure is flexible and applicable to
other types of data.
Several refinements to the estimation procedures
are possible. Although not attempted here, additional
variables could be used to augment the pollutant
variables and improve performance, For example,
ambient temperature could be used to help predict 03
concentrations. More accurate estimators might disaggregate by season, wind direction, or other features
122
STUARTA. BATTERMAN
REFERENCES
123