Sie sind auf Seite 1von 5

Uncertainty with the Gamma Test for Model Input Data Selection

Dawei Han, Weizhong Van and Alireza Moghaddam Nia

Abstract- The Gamma Test has attracted the attention of


many researchers in the nonlinear modeling field, especially
with Artificial Neural Networks. In theory, the test should
provide a modeler with valuable information to find the best
input variables without extensive model development for each
potential input combination. However, it has been found that
the Gamma Test does not always point to the best input
combination as validated by the cross validation method. This

models (ANN, LLR - Local Linear Regression, ..) [5, 6],


few validations on the effectiveness of the Gamma Test have
been done. They usually tried the Gamma Test and used the
results to guide their subsequent model development. A
recent study by Han and Van [16] has found that the Gamma
Test didn'l point out the best input combination. Further
exploration on the uncertainty of the Gamma Test is needed.

paper presents a study of using the generalized regression


neural network (GRNN) to estimate evaporation. Both the
Gamma Test and cross validation are used to find the best
model input combination.

It has been found that the Gamma

Test is not able to identify the best input variables, but the best
result is included in the top Gamma value group. The standard
error has very valuable information for choosing the group
members. This demonstrates that the Gamma Test is still a
valuable tool in significantly reducing the modeling workload.
The

reason

for

this

phenomenon

is

discussed

under

the

relationship between the Gamma estimate and its stand error.


Further research is still needed to explore this relationship in
more efficient model input selections.

I. INTRODUCTION
Nonlinear model development is much more complicated
than that of linear ones. The conventional tools such as cross
correlation and Principal Component Analysis (PCA) are
usually not suitable for nonlinear systems. Natural systems,
such as hydrological processes, are usually complex and
nonlinear. A hydrological modeller needs to use trial and
error method to build mathematical models (such as ANN Artificial Neural Networks) for different input combinations
[1 - 3]. This is very time consuming since the modeller needs
to calibrate and test different model structures with all the
likely input combinations. In addition, there is no guidance
about what accuracy a best model is able to achieve. In this
study, the Gamma Test [4] developed by the computer
scientists in Cardiff University is explored for its suitability
in reducing model development workload and providing
input data guidance before specific models are developed
(i.e., its result is independent of the models to be developed).
Theoretically, the Gamma Test is able to provide the best
mean square error that can possibly be achieved using any
nonlinear smooth models. Although the Gamma Test has
been applied by some researchers in identifying data mining
Manuscript received February 4 , 20 I o.
Dawei Han is with Department of Civil Engineering, University of
Bristol, BS8, ITR, UK (phone: +44 11 7 3315 739; fax: +44 11 7 3315 71 9;
email: d.hanlalbristol.ac.uk)
Weizhong Yan is with GE Research, USA (email: yan@crd.ge.com)
Alireza Moghaddam Nia is with University of Zabol, Iran (email:
ali.moghaddamnia@gmail.com)

978-1-4244-8126-2/101$26.00 2010 IEEE

In this study, the validation of the Gamma Test is carried


out from a case study based on the evaporation data in the
Sistan region of Iran. The data set contains 11 years of wind,
temperature, saturation vapour pressure deficit, relative
humidity and pan evaporation. Different combinations of
input data were explored to assess their influence on the
evaporation estimation modelling. The nonlinear dynamic
model tested is the generalized regression neural network
(GRNN), a special type of neural network. The training and
testing data are partitioned by random selection from the
original data set. The uncertainty with the Gamma Test is
explored in the relationship between the Gamma value and
its standard error.
II. STUDY AREA
The study area is the Sistan plain located in the Southeast
of Iran, one of the driest regions in the country and famous
for its "120 day wind", a highly persistent dust storm in the
summer which blows from the north and northwest to the
south and southeast with velocities of nearly 20 knots.
Hirmand River, originated from Afghanistan, is bifurcated
into two branches when it reaches the Iranian border, namely
Parian and Sistan. The Hirmand River flows through Sistan
plain and discharges into the natural swamp of Hamun-e
Hirmand (Figure 1). As can be seen in the figure, the Sistan
plain is essentially an inland delta with its major
watercourses leading to a series of lakes. The Sistan delta
has a very hot and dry climate and its climate is classified as
BWh according to the Koppen Climate Classification (i.e.,
main climate B: arid; precipitation W: desert; temperature h:
hot) [7]. Strong winds in the region are quite unique and are
an important contributing factor for the high evaporation.

(1 k p) nearest neighbors XN[ik, ] (1 k p) for


each vector Xi (1 iM). p is the maximum near

t
kh

neighbors to be included. Specifically, the Gamma Test is


derived from the Delta function of the input vectors:

1 M XN ik Xi 12 ( p
8M(k)=-".
M L... t=1 I ( ' )- l k )
where ,..., denotes
the Euclidean distance,

(3)
and the

corresponding Gamma function of the output values is:


Fig. I. The Sistan Region and the Chahnime reservoirs

YM(k) 2IIIYN(i'k)- Yir (lkp) (4)


where YN(i,k) is the corresponding y-value for the k th
nearest neighbor of Xi in Eq. (4). In order to compute f a
=

The daily weather variables are measured by an automated


weather station at Chahnime of Zabol (latitude 61'40' 61'49' W, longitude 30'45' - 30'50' N) operated by the IR
Sistan and Balochastan Regional Water (lR SBRW,
http://www.sbrw.ir). The data set consisted of eleven years
(1995-2006) of daily records of air temperature (T), wind
speed (W), saturation vapor pressure deficit (Ed), relative
humidity (RH) and pan evaporation (E). The data are
randomized before they are split into training and testing sets
(with a ratio of 2 to 1). The randomization is to remove the
impact of any local data clusters and make the training and
testing data sets similar.

III. GAMMA TEST


The Gamma Test estimates the minimum mean square
error (MSE) that can be achieved when modeling the unseen
data with any continuous nonlinear models. The Gamma Test
was firstly reported by Adoalbjom, et al. in 1997 [4], and
later enhanced and discussed in detail by many researchers
[8 -12]. Only a brief introduction on the Gamma Test is
given here and the interested readers should consult the
aforementioned papers for further details.
Suppose we have a set of data observations

{(xi,yJ,liM}

(1)

where the input vectors Xi R"' are vectors confined to


some closed bounded set C Rm and the corresponding
outputs Yi R are scalars. The underlying relationship of the
system is of the following form
y

=f(xl ...x m )+r

where,

(2)

t is a smooth function and r is a random variable

that represents noise. It can be assumed that the mean of the


r"s distribution is zero (since any constant bias can be
subsumed into the unknown function t) and that the
variance of the noise Var(r) is bounded. The domain of a
possible model is now restricted to a class of smooth
functions which have bounded first partial derivatives. The
Gamma statistic f is an estimate of the model"s output
variance that cannot be accounted for by a smooth data
model. The Gamma Test is based on

N[i,k], which are the

least squares regression line is constructed for the p points

(8M(k)'YM(k))
y=A8+f

(5)

The intercept on the vertical axis

YM(k) Var(r)
in probability as 8M(k) 0

(8 =0) is the f

value,
(6)

Calculating the regression line gradient can also provide


helpful information on the complexity of the system under
investigation (a steeper gradient indicates a model of greater
complexity). A formal mathematical justification of the
method can be found in Evans and Jones [14]. We can
standardize the result by considering a term Vratia, which
returns a scale invariant noise estimate between zero and
one.
VrallO
.

=_
(Y2 f _

(y)
(Y
2 (y) is the variance of output
where,

(7)
y,

which allows a

judgment to be formed independently of the output range as


to how well the output can be modeled by a smooth function.
A Vratia close to zero indicates that there is a high degree of
predictability of the given output y. The Gamma Test result
would avoid the overfitting of a model beyond the stage
where the MSE on the training data is smaller than Var(r)
and help us to decide the required data length to build a
meaningful model. The Gamma Test analysis can be
performed using winGamma software implementation
[12].
IV. GRNN MODEL
In this study, a generalized regression neural network
(GRNN) is used for evaporation estimation and to validate
the Gamma Test. GRNN is a special type of neural network
and a universal approximator that can approximate a
continuous function to an arbitrary accuracy, given a
sufficient number of neurons [15]. An important advantage

of GRNN is that its training is very fast and this is a very


useful feature in this study since we need to try many input
combinations to validate the Gamma Test (more advantages
of the GRNN are listed below). As shown in Figure 2, a
typical GRNN has two layers of artificial neurons: radial
basis neurons on the first layer and a linear neuron on the
second layer. The number of those radial basis neurons is
equal to the number of training samples. While input to each
of the radial basis neurons is the distance between the input
vector and the training sample, output of a radial basis
neuron is the radial basis function of the input scaled by the
spread factor. These radial basis neurons give an output
characterizing the closeness between input vectors and
training samples.

{ Xi,Yi } EiRn xiR I , i=I,2, m


as the training samples. The GRNN output, y(x), for a test
point, x, is defined as:
Given

input-output pairs

. .

y(x)= L Wi' Yi

(8)

i=1

where

Wi =

Comparing with conventional multilayer perceptron


networks, GRNN has several advantages, including 1) it can
accurately approximate functions from sparse and noisy data;
2) it can converge to the conditional mean surface with
increasing the number of data samples; 3) it only has one
design parameter (i.e., the spread factor); and 4) it is easy to
train. It is these unique advantages associated with GRNNs
that make us to choose GRNN as our evaporation prediction
model.

factor is the design parameter of GRNNs, which dictates


GRNN prediction performance. Unfortunately there is no
single spread factor that works well for all the problems.
Moreover, there is no analytical way to accurately determine
the spread factor for a given problem at hand. Empirical
determination of spread factor is a common practice in the
design of GRNNs. For each experiment concerned in this
study, a fixed set of spread factors from 0.14 to 0.32 with an
increment of 0.03 are used. The best GRNN performance
(e.g., MSE) over all of these spread factors is reported.

V.

In this study, different combinations of the input data were


explored to assess their influence on the evaporation
modeling. There were 2 m
meaningful combinations of

-1

inputs (i.e., 15 in this study); from which, the best one can be
determined by observing the Gamma values, which indicates
a measure of the best MSE attainable using any modeling
methods for unseen input data. Thus, we performed the
Gamma Tests in different combinations varying the number
of inputs as shown in Table 1. In the table, the minimum
value of r was observed when we use all the input variables
W, T, RH, Ed. In theory, the gradient is considered as an
indicator of model complexity. V-ratio is a measure of
predictability of given outputs using available inputs. An
input data set with low Gamma value, in addition to low
gradient and V-ratio, is considered as the best scenario for
the modeling. Since there is a lack of quantitative guide on
using gradient and V-ratio, only the Gamma values are used
in the analysis. It can be seen that the best result for a well
calibrated model should be around 7.28 since this would be
the innate noise level embedded within the data.
.
TABLE I Gamma Tests fior 15 'mput vanable comb'mations
No

Fig. 2. A typical GRNN architecture

GRNN has one tunable parameter, the spreadfactor (u in


Equation 8 above). When the spread factor is small the radial
basis function is steep. The network tends to respond to
those closest to the input vector, which results in a rough
response surface. As the spread factor increases, the radial
basis function becomes wider and several neurons may
respond to an input vector. The network then acts like it is
taking a weighted average of those neuron outputs, which
leads to a smoother response surface. Therefore, spread

RESULTS

Mask

Gamma

Gradient

V-ratio

21.1

2723

0.44

10

52.31

-53791

1.11

II

17.73

-3

0.46

100

18.16

637

0.61

101

17.88

240

0.47

110

17.88

-286

0.36

111

17.7

-39

0.37

1000

39. 6

-5502

0.48

1001

9.4

597

0.22

10

1010

25.89

1511

0.45

11

1011

7.3

509

0.20

12

1100

7.44

460

0.12

13

1101

7.37

127

0.15

14

1110

7.46

109

0.10

15

IIII

7.28

142

0.14

TABLE 3 Gamma Tests with SE (Standard Error)

Note: the mask represents variables W, T, RH, Ed (wind,


temperature, relative humidity and saturation vapor
pressure deficit)
To validate the Gamma Test results, GRNN models have
been trained and tested with the same input combinations as
the Gamma Test. The GRNN results are shown in Table 2.
TABLE 2 GRNN results for 15 input variable combinations
Training

Testing

No

Mask

RI\2

mse

RI\2

mse

0001

0.75

20.98

0.759

20.6

0010

0.391

51.19

0. 412

50.34

0011

0.786

18.06

0.785

18.35

0100

0.775

18.89

0.779

18.91

0101

0.789

17.75

0.789

18.03

0110

0.788

17.89

0.788

18.19

0111

0.79

17.64

0.79

17.95

1000

0.544

38.37

0. 542

39. 26

1001

0.889

9.38

0.885

9.83

10

1010

0.699

25.38

0. 687

26.73

11

1011

0.915

7.25

0.896

8.9

12

1100

0.91

7.6

0.904

8.27

13

1101

0.917

7.02

0.907

7.99

14

1110

0.921

6.7

0.905

8.16

15

1111

0.926

6.24

0.907

8.01

It is interesting that the best training result matched quite


well with the Gamma Test, but the best testing result is
different. Instead, the input variable combination of 1101
(i.e., when relative humidity is removed) is the best in the
model testing stage.

Mask

No

Gamma

SE

21.1

0.25

10

52.31

0.62

II

17.73

0.21

100

18.16

0.21

101

17.88

0.21

110

17.88

0.21

III

17.7

0.21

1000

39.6

0.47

1001

9.4

0.11

10

1010

25.89

0.31

II

1011

7.3

0.09

12

1100

7.44

0.09

13

1101

7.37

0.09

14

1110

7.46

0.09

15

1111

7.28

0.09

The standard error is a valuable source for assessing the


uncertainties of the estimated Gamma values. Table 4
presents the 68.2% confidence intervals for the lowest
Gamma value group if we assume the Gamma value errors
follow a Gaussian distribution as shown in Figure 3.
....
o
'"
o
N
o

-30

- 20

-10

10

20

30

Fig. 3. The Gaussian Uncertainty Distribution [17]

VI.

DISCUSSION
TABLE 4 Gamma Tests with SE (Standard Error)

From the results, it is interesting to note that the Gamma


Test failed to pick up the best input variable combinations
for the GRNN model. However, a further look into the
results reveals that there is a cluster of low Gamma values
(from No 11 to No 15) and they match very well with the
best GRNN testing results. This indicates that although the
Gamma Test didn' provide the very best model input
variables, it did narrow down the best group of the inputs
that included the very best input combination.
The uncertainty with the Gamma Test prompts us to
explore another important property of the Gamma values: its
standard error (SE). SE indicates the variance with the
Gamma estimation. In the past, SE has been largely ignored
by model developers and we have listed them in Table 3 for
the case study. It can be seen that the lowest Gamma value
group has a standard error of 0.09.

Mask

Gamma

SE

Gamma Range

1011

7.30

0.09

7.21 - 7.39

1100

7.44

0.09

7.35 - 7.53

1101

7.37

0.09

7.28-7.46

1110

7.46

0.09

7.37-7.55

1111

7.28

0.09

7.19-7.37

The Gamma value ranges with 68.2% confidence intervals


clearly show that overlaps are visible among the lowest
Gamma values. The best input combination indicated by the
Gamma value of 7.28 is Mask 1111. However, the best input
combination from the cross validation is Mask 1101 with a
Gamma value of 7.37. However, if 68.2% uncertainty bands
are considered, the clear advantage of Mask 1111 over Mask
1101 becomes fuzzy. We are no longer confident about
Mask 1111. From Figure 3, it can be seen that if 2 or 3
standard deviations are considered, the overlaps among the

lowest Gamma values would be even larger and the selection


of the best input variables based on the lowest Gamma value
would be even harder.

VII.

CONCLUSIONS

The Gamma Test is potentially a very powerful tool for


mathematical model developers to build nonlinear models
more efficiently. In theory, if there are sufficient high quality
data, the Gamma values should provide reliable information
about the best input variable combination. However, real
data suffer from both quantity and quality problems. The
standard error in the Gamma Test has been largely ignored
by modellers in the past and this study demonstrated that
such information is very valuable to provide the uncertainty
about the estimated Gamma estimation. The higher of the
standard error is, the more uncertain of the Gamma value
would be. Despite this uncertainty problem, the Gamma Test
is still very useful because it can narrow down the input
variable selection process by providing us with a group of
potential input combinations indicated by the lowest Gamma
value group. A modeller should then further explore the
input combinations in this group instead of a large number of
all the possible combinations. The selection for the group
members should be guided by the standard error. The larger
of the standard error is, the larger the group would be, hence
more time and effort are needed to find the best input
combination among the group. Clearly more research is still
needed about how to use the standard error and the Gamma
values to choose the most effective group size that could
minimise the potential errors in selecting the less optimal
input combinations, in the same time to minimise the group
size to reduce the model development time and effort. It is
also important to try different data sets with several
irrelevant variables to check if the Gamma Test could
effectively remove them.

REFERENCES
[I]

[2]

[3]

[4]
[5]

[6]

D. Han, L. Chan and N. Zhu, "Flood forecasting using Support


Vector
Machines",
Journal
of
Hydroinformatics,
DOI:IO.2166lhydro.2007.027,Vol . 9 No. 4,267-276,2007
D. Han, T. Kwong, and S. Li, "Uncertainties in real-time flood
forecasting with neural networks", Hydrological Processes, DOl:
10. I002lhyp.6I84,21,223-228,2007
M. Bray and D. Han,"Identification of Support Vector Machines for
Runoff Modelling", Journal of Hydroinformatics,6,Issue 4,265-280,
2004
S. Adoalbjorn, N. Koncar, A1. Jones, "A note on the Gamma Test".
Neural Computing and Applications, 5(3):131-133,1997
A Moghaddamnia, M. Ghafari Gousheh, J. Piri, S. Amin, and D.
Han, "Evaporation Estimation Using Artificial Neural Networks and
Adaptive Neuro-Fuzzy Inference System Techniques", Advances in
Water Resources, doi:10.1016/j.advwatres.2008.10.005 , Vol. 32 No
I,pp88-97,2009
R. Remesan,M.A. Shamim and D. Han,"Model Data Selection using
Gamma Test for Daily Solar Radiation Estimation", Hydrological
Processes,Vol . 22 No. 21,4301-4309,DOl: 1O.1002lhyp.7044,2008

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]
[15]
[16]

[17]

M. Kottek,1. Grieser,C. Beck,B. Rudolf,and F. Rubel,"World Map


of the KOppen-Geiger climate classification updated". Meteorol. Z.
15: 259-263. doi:IO.1127/0941-2948/2006/0130,2006
N.A Chuzhanova, A1. Jones and S. Margetts,"Feature selection for
genetic sequence classification". Bioinformatics 14(2): 139-143,
1998
AG. De Oliveira, "Synchronisation of chaos and applications to
secure communications". PhD thesis, Department of Computing,
Imperial College of Science, Technology and Medicine, University of
London,1999
AP.M. Tsui, "Smooth Data Modelling and Stimulus-Response via
Stabilisation of Neural Chaos". PhD thesis, Department of
Computing, Imperial College of Science, Technology and Medicine,
University of London,1999
AP.M. Tsui, AJ. Jones and AG. De Oliveira, "The construction of
smooth models using irregular embeddings determined by a gamma
test analysis". Neural Computing and Applications 10(4):318-329.
10.1007Is005210200004,2002
P.J. Durrant, "winGamma: A non-linear data analysis and modelling
tool with applications to flood prediction". PhD thesis,Department of
Computer Science,Cardiff University,Wales,UK,2001
A1. Jones,A Tsui and AG. De Oliveira,"Neural models of arbitrary
chaotic systems: construction and the role of time delayed feedback in
control and synchronization". Complexity International Vol . 09,2002
D. Evans and A1. Jones, "A proof of the gamma test. Proceedings of
Royal Society". Series A 458(2027): 2759-2799,2002
D. Specht, "A General Regression Neural Network". IEEE Trans. on
Neural Networks,2,6,568-76,1991
D. Han and W. Van,"Validation of the Gamma Test for Model Input
Data Selection - with a case study in evaporation estimation", 5th
International Conference on Natural Computation (ICNC'09),Tianjin,
China, August 14-16, 2009 http://www.icnc09-fskd09.ut.edu. cnl.
ISBN 978-0-7695-3736-8 BMS Number CFP09CNC-PRT
Distribution':
,,Normal
Wikipedia
http://en.wikipedia.orglwikilNormal_distribution

Das könnte Ihnen auch gefallen