Beruflich Dokumente
Kultur Dokumente
www.elsevier.com/locate/envpol
‘‘Capsule’’: The adoption of regression models for heavy metals could provide a substantial improvement in risk
assessment practices.
Abstract
Risk assessment studies apply fate and transport models to predict the behaviour of chemicals in the environment. The
definition of physico-chemical properties is crucial to predict the mobility of pollutants and heavy metals in particular within
the environmental compartments. The conservative approach normally adopted at a screening level in attributing a value to
the Kd value, results in an extremely variable mobility in soil. In this paper a regression model to estimate rapidly the Kd for
heavy metals is proposed and applied to Pb, allowing a considerable reduction (3–4 orders of magnitude) of the estimation
uncertainty. The application of a stepwise forward multiple regression to literature data provided a pH-dependent regression
equation of the soil–water distribution coefficient (Kd) for Pb: log Kd=1.99+0.42 pH.
# 2003 Elsevier Ltd. All rights reserved.
Keywords: Risk assessment; Heavy metals; Regression model; Partitioning
particularly during the screening level of the analysis, regression, it normally increases the representativity
in order to save money and time. In most cases it of the regression model.
implies the selection of the lowest KD values (i.e.
high mobility) reported in literature. It can be
argued that the KD estimation should take into 2. Methods
account soil properties, wherever known, in order to
derive more likely realistic than conservative values. 2.1. Multivariate regression method
However, suitable algorithms to derive the KD from
soil properties in risk assessment studies are not yet Kd and Pbdis were correlated to the soil character-
available. istics by a stepwise forward regression analysis based
A considerable number of studies have tried to on the linear least square method. This consisted of
explain the mobility of heavy metals in soils based on calculating the linear regression so as to minimise the
the multiple regression analysis of extractable metal squared deviations of the observed points from the
concentration versus soil properties, considered as predicted points (the squared residuals). The stepwise
dependant variables such as total metal content, pH, forward procedure consisted of including the inde-
organic matter content (OM), cation exchange capacity pendent variables in the model one by one, setting
(CEC), and the presence of Fe and Al oxides (Feox and the condition so that the F value, calculated as the
Alox) (e.g., John, 1972; Kuo et al., 1985; Bogacz, 1994, regression mean square on residual mean square, was
He and Singh, 1993; Hornburg and Brummer, 1993; higher than 1. This method allows the inclusion in
Jopony and Young, 1994; McBride et al., 1997; Sauvé et the regression model of only the most significant
al., 1997, 1998). Other authors have used different variables.
dependent variables, such as sorption coefficients (e.g., The goodness-of-fit of the multiple regression model
Harter, 1979; Anderson and Christensen, 1988; Kuo was evaluated by the adjusted R2-value (R2-adj), which
and Jellum, 1991, Janssen et al., 1997) and soil migra- is the approximation of the coefficient of determination
tion rate (Dragun, 1988). A number of authors have adjusted for the number of added predictor variables,
considered the fraction of extractable metal against the the F-value, and the Standard Deviation Error of the
total soil metal as the dependent variable (Hornberg Estimate (SDEE), which provides the dispersion of the
and Brummer, 1993; Rieuwerts et al., 1998). Most of observed values around the regression line. The level of
these multivariate regression exercises aim to infer the significance of the interpolation, the intercept and the
relationships between metal solubility and soil proper- dependent variable coefficients were evaluated on the
ties, while some studies aim to develop predictive metal basis of the relative calculated p-values. An analysis of
solubility algorithms. Some authors (Sauvé et al., 1997; residuals was carried out to test the goodness of fit and
Janssen et al., 1997; Jopony and Young, 1994; the homoschedasticity of the equation (Einax et al.,
McBride et al., 1997) considered a large number of soil 1997).
samples covering a broad range of different soil types, The predictive capability of the regression model was
pollution sources and pollutant concentrations, in determined using the training/validation splitting
order to obtain more general results. However, these method, which consists of splitting the dataset into a
algorithms were rarely compared to each other and training set, used for calculating the regression model,
their validity was limited to specific case studies. Dif- and a validation set, used for validating the model
ferences between algorithms may derive from differences (Massart et al., 1996).
in the experimental procedures used to measure the The predictive performance of the regression model
metal solubility as well as from different sorption mod- was estimated using the predicted or cross validated
els and regression methods used to relate solubility to variance, expressed as Q2, and the Standard Deviation
soil properties. Error of Prediction (SDEP) (Tosato et al., 1992; Piazza
In this paper, a method to provide suitable KD et al., 1995):
values in risk assessment studies based on literature 0:5
data is proposed and an example for Pb KD is SDEP ¼ i yobs ypre 2 =n ;
provided. Literature datasets were collected and vali-
dated for this purpose. The KD-soil properties corre- where yobs are observed values external to the dataset
lations were explored for each dataset by a used for the model construction, and ypre are the model
multivariate regression exercise. Different data sets predictions.
were combined to create a master data set, which The multiple stepwise forward regression was per-
was used to develop a multivariate regression predic- formed by using the Statistica 4.1 (Stat Soft Inc., 1992)
tion model of Pb KD depending on soil properties. program, while the calculation of Q2 and SDEP was
The guiding principle was that although the performed by using the Excel97 data management pro-
combination of different datasets may decrease the gram (Microsoft, 1997).
C. Carlon et al. / Environmental Pollution 127 (2004) 109–115 111
3. Results and discussion and pH were retained in the combined dataset. A plot of
log Pbtot against pH of the Master dataset shows a
3.1. Data selection homogenous distribution of the predictive variables
covering a pH range from ca. 3.5 to ca. 8.5 and a log
An extensive literature survey was undertaken to find Pbtot range from ca. 0.5 to ca. 5 mg/kg d.w. (Fig. 1).
studies which may contain suitable data. Finally, five
data sets were selected, reported by Jopony (1991), 3.2. Log Kd and log Pbdis vs soil properties
Sauvé et al. (1997), Janssen et al. (1997), Aten and
Gupta (1996), Gerritse and Van Driel (1984). For each dataset, a stepwise forward regression pro-
Mean, median, maximum, minimum and standard cedure was applied separately to the log transformed
deviation of individual data for each data set are pre- values of Pbdis and Kd (logKd or log Pbdis=a+b pH+c
sented in Table 1. log Pbtot+d log Alox+e log PO3 4 +f log OM). The
Since all the parameters of each dataset showed a log- results are reported in Tables 2 and 3 respectively for
normal type distribution, the data used in this study log Pbdis and log Kd with dependent variable coeffi-
were log-transformed (except pH data) before multiple cients, intercepts, relative standard errors, adjusted R2,
regression was carried out, in order to meet the F value, p-level and standard deviation errors of the
assumption of normality required by the regression estimates. Janssen et al. (1997) used the multiple regres-
method. Even when the selected dataset had been sion method to relate the Kd to the soil characteristics
already used in multivariate regressions to relate Pbdis and reported results very similar to those recalculated in
or Kd to soil properties (i.d., Sauvé et al. for Pbdis, 1997, this study. However, since the data of Fe oxyhydroxides
and Jansen et al. for Kd, 1997), a new regression was (log Feox) were not available, the relation between log
performed to obtain and compare the same statistical Kd and log Feox could not be determined, therefore a
parameters for each dataset. significant correlation between log Kd and log Alox was
The correlation matrix of each dataset showed that found. While Sauvé et al. (1997) applied the multiple
the correlation coefficient of dependent variables was regression technique to the raw Pbdis data, in this study
negligible, i.e. lower than 0.05, so that they could be the data was log-transformed in order to meet the
included in the multiple regression exercise. assumption of normal distribution.
The combined dataset is assumed to be representative The results showed that pH was a significant pre-
of a wide variety of soil types and experimental condi- dictive parameter in all the regression equations for log
tions. Since the parameters OM, Alox and PO3 4 were Pbdis and log Kd. Log Pbtot was also a significant pre-
not present in all dataset, only the parameters Kd, Pbtot dictive parameter in all the regression equations for log
Table 1
Mean, median, minimum, maximum and standard deviation of the selected data setsa
Authors and no. of samples Chemical parameters Mean Median Minimum Maximum St. Dev.
Fig. 1. A log Pbtot vs pH bivariate plot showing the range of variation of the Master data obtained combining the data from six datasets.
Pbdis but only in three out of the five equations for log nature of soil organic matter. The relationship between
Kd. Since Kd is [Pbtot]/[Pbdis], a positive correlation OM and pH is also evident in the Gerritse and Van
between log Kd and log Pbtot indicates that Pbdis does Driel regression exercise, since the log OM significance
not increase proportionally with the increase in Pbtot. decreases after the addition of pH and log Pbtot to the
Amorphous oxyhydroxides (Alox) in soil and dis- independent variables.
solved phosphate in pore water (molybdo-PO4) were In conclusion, pH and log Pbtot were found to be the
proved significant predictors of both log Pbdis and log most suitable parameters for predicting Pbdis and log
Kd for the Janssen and Sauvé datasets, respectively. Kd. Considering the pH and log Pbtot dependent
However, Alox in soil and phosphate in soil solution regressions for both log Pbdis and log Kd, the regres-
are not widely measured parameters included in soil sions of the Jopony dataset (Jo.2 and Jo-k.2 for log
investigation and hence do not provide a relevant Pbtot and log Kd respectively, Table 2) exhibited the best
increase of the explained variance (R2). fit, i.e. the highest fraction of explained variance, and
On the basis of the Sauvé, Janssen and Aten & Gupta the lowest standard deviation errors of estimate, coeffi-
dataset, log OM and log CEC in soil are not significant cients and intercept. For both the log Pbdis algorithms
predictors of either log Pbdis or log Kd. However, log and the log Kd algorithms, it can be seen that the pH
OM was highly positively correlated to log Pbdis and log coefficients obtained from different datasets are very
Kd in the Gerritse and Van Driel dataset. similar. On the basis of these results, log Kd depends on
This apparent discrepancy can be ascribed to the pH with a regression coefficient ranging from 0.34 to
difficulty in treating the effect of organic matter and 0.47. The standard errors of the intercepts are generally
CEC by a simple correlation analysis because: (i) pH higher than those of the dependent variable coefficients.
is a controlling variable on metal complexation by The plots for the equations of the predicted values ver-
organic matter, (ii) the low variability of organic sus the observed values with a linear regression line fit-
matter content can make the tests of statistical sig- ting the data with 95% confidence interval. None of the
nificance misleading; (iii) treating organic matter con- scatterplots underscored relevant outliers and ethero-
tent as a single variable ignores the heterogeneous schedasticity effects.
Table 2
Results of the multiple stepwise forward regression log Pbdis=a+b pH+c log Pbtot+d log Alox+e log PO3-
4 +f logOM, where Pbdis is the total dissolved Pb (mg/L), Pbtot (mg/kg d.w.) is the total Pb
content in soil (after HNO3 digestion), Alox (mg/kg) is the amount of aluminium extracted by ammonium oxalate/oxalic acid, OM is the organic matter content (g/kg) determined after dichromate
oxidation, and PO4 (mM of P) is molybdo-reactive phosphate in soil solution
Jopony (1991) n=100 Step 1 Eq. Jo.1 0.54 118 Est. 0.36 0.5 0.2 (P=0.019) 0.65 0.06 (P=1E-18)
Variables=Pbdiss, Pbtot, pH Step 2 Eq. Jo.2 0.91 49 Est. 0.16 1.2 0.1 (P=6E-15) 0.47 0.02 (P=7E-36) 1 0.03 (P < 1E-18)
Sauvé et al. (1997) n=84 Step 1 Eq. Sa.1 0.075 8 Est. 0.50 0.06 0.21 (P=0.8) 0.23 0.08 (P=0.007)
Variables=Pbdiss, Pbtot, pH, OM, PO4 Step 2 Eq. Sa.2 0.32 20 Est. 0.43 1.1 0.3 (P=8E-05) 0.26 0.05 (P=5E-07) 0.51 0.09 (P=1E-07)
Step 3 Eq. Sa.3 0.38 18 Est. 0.41 1.6 0.3 (P=8E-07) 0.30 0.05 (P=9E-09) 0.54 0.08 (P=1E-08) PO3-
4 0.01 0.004 (P=0.003)
Table 3
Results of the multiple stepwise forward regression log Kd=a+b pH+c logPbtot+d logAlox+e log molybdo-PO4 +f log OM, where Kd (L/kg) is the soil-water Pb distribution coefficient, Pbtot
(mg/kg d.w.) is the total Pb content in soil (HNO3 digestion), Alox (mmol/kg) is the amount of aluminium extracted by ammonium oxalate/oxalic acid, OM is the organic matter (% of dry matter)
determined after dichromate oxidation, and PO3- 4 (mM of P) is molybdo-reactive ortophosphate in soil solution
Jopony (1991) n=100; Step 1 Eq. K-Jo.1 0.85 564 Est. 0.16 1.8 0.1 0.47 0.02
Variab.=Pbdis, Pbtot, pH (P=7E-25) (P=2E-42)
Sauvé et al. (1997) n=84 Step 1 Eq. K-Sa.1 0.50 85 Est. 0.50 2.9 0.2 (P=1E-23) 0.76 0.08 (P=3E-14)
Variab.=Pbdis, Pbtot, pH, OM, molybdo-PO4 Step 2 Eq. K-Sa.2 0.63 72 Est. 0.43 1.9 0.3 (P=1E-10) 0.26 0.05 (P=5E-07) 0.49 0.09 (P=3E-07) Molybdo-PO4 -0.01 0.004 (P=0.003)
Step 3 Eq. K-Sa.3 0.67 56 Est. 0.41 1.4 0.3 (P=4E-06) 0.30 0.05 (P=9E-09) 0.46 0.08 (P=4E-07)
Janssen et al. (1997) Step 1 Eq. K-Ja.1 0.37 35 Est. 0.37 2.1 0.3 (P=1E-05) 0.35 0.06 (P=1E-05) Log Alox 0.4 0.3 (P=0.21)
n=19 Variab.=Pbdis, Pbtot, pH, Alox. Step 2 Eq. K-Ja.2 0.66 19 Est. 0.36 1.8 0.4 (P=0.0005) 0.31 0.06 (P=1E-04)
Gerritse and Van Driel (1984) Step 1 Eq. K-Ge.1 0.51 32 Est. 0.51 3.7 0.3 (5E-14) 0.7 0.1 (P=4E-06)
n=31 Variab.=Pbdis, Pbtot, pH, OM. Step 2 Eq. K-Ge.2 0.64 27 Est. 0.44 2.1 0.5 (P=0.0002) 0.25 0.07 (P=0.002) 0.7 0.1 (P=3E-06)
Atens and Gupta (1996) Step 1 Eq. K-At.1 0.77 41 Est. 0.33 2 1 (P=0.05) 1.0 0.2 (P=5E-05)
n=13 Variab.=Pbdis, Pbtot, pH, CEC. Step 2 Eq. K-At.2 0.91 64 Est. 0.20 1.8 0.7 (P=0.02) 0.7 0.1 (P=8E-05) 0.7 0.2 (P=0.001)
113
114 C. Carlon et al. / Environmental Pollution 127 (2004) 109–115
3.3. A predictive model for estimating the Pb soil–water the Master 1 and Master 2 datasets. As a result, log Kd
coefficient (Kd) in soil could be explained as a univariate function of pH.
The predictive capability of Master 1 and Master 2
The combined dataset, named Master, was composed based models was expressed in terms of Q2 and SDEP
of the values of log Kd, pH, and log Pbtot from each (Table 4). The predicted variance Q2 is rather low and
dataset, whereas other variables, such as OM, Alox and the Standard Deviation of Prediction is quite high.
CEC, were not included because they were not present in all However, the similarity of Q2 and SDEP values for the
the original datasets. However, the regression exercise Master 1 and Master 2 models proves the model stabi-
applied to each original dataset proved that the significance lity. According to the training/validation splitting
of these latter variables for predicting log Kd is low. method (Massart et al., 1996), the Master 1 model,
The combined dataset was divided into two sub-data- which showed the best predictive performance, was
sets, named Master 1 and Master 2, by splitting alternate selected as representative of the Master dataset.
cases of the dataset ranked according to a log Kd value A visual presentation of the predictive capability of
increasing order. Master 1 and Master 2 datasets were the Master 1 model against the Master 2 dataset is given
equally representative of the original dataset, as they cov- by a plot of predicted vs external validation data
ered the whole range of log Kd variation and were homo- (Fig. 3), along with the training set values. The dashed
geneously distributed across the range of variation of log lines around the 1:1 line represent the 95% upper and
Pbtot and pH, as showed by the log Pbtot vs pH in Fig. 2. lower confidence limits calculated as Standard
The Master 1 dataset was used as a training dataset Deviation of Error of Prediction (SDEP).
for calculating a regression model, whose prediction Since the distribution of residuals was proved to be nor-
capability was tested using the Master 2 dataset. Then, mal, 2SDEP can be regarded as the 95% confidence
the reciprocal validation was performed, using Master 1 limits of the data population. The scattered distribution of
as the training set and Master 2 as the validation set, in data around the 1:1 line indicates the low predictability of
order to verify the stability of the model. log Kd values. However, an accurate prediction of log Kd
The results of the multivariate regression of log Kd is beyond the scopes of this study and the lower 95% con-
against log Pbtot and pH for the Master, Master 1 and fidence limit of the prediction can be regarded as a con-
Master 2 datasets are reported in Table 4. It can be servative estimation of log Kd. It is remarkable that this
noticed that log Pbtot gives a very small contribution to model allows the reduction of the range of uncertainty
the log Kd prediction on the basis of the combined of the Kd estimation from four to one order of magni-
dataset, and appears not to be a significant predictor for tude simply on the basis of the soil pH measurement.
Fig. 2. Plot of Master 1 and Mater 2 data according to log Pbtot and
pH.
Table 4
Predictive capability of Master 1 and Master 2 based regression mod-
els calculated by reciprocal external validation: PRESS values of
Master 1 model vs Master 2 data and Master 2 model vs Master 2 data
Q2 SDEP
Master 1 model vs Master 2 data 0.47 0.49 Fig. 3. Plot of both the Pb log Kd values predicted by using the
Master 2 model vs Master 2 data 0.37 0.52 Master 1 regression model and the Pb log Kd training set values
against the observed data of the validation Master 2 dataset.
C. Carlon et al. / Environmental Pollution 127 (2004) 109–115 115