2005 - On The Relationship Between Training Sample Size and Data Dimensionality

Remote Sensing of Environment 98 (2005) 468 480
www.elsevier.com/locate/rse
On the relationship between training sample size and data dimensionality:

Monte Carlo analysis of broadband multi-temporal classification
Thomas G. Van Niel a,d,*, Tim R. McVicar b, Bisun Datt c
a
CSIRO Land and Water, Private Bag No 5, Wembley, WA 6913, Australia

CSIRO Land and Water, PO Box 1666, Canberra, ACT 2601, Australia
c
CSIRO Earth Observation Centre, GPO Box 3023, Canberra, ACT 2601, Australia
Cooperative Research Centre for Sustainable Rice Production, Yanco, NSW 2703, Australia
b
Received 24 January 2005; received in revised form 23 August 2005; accepted 29 August 2005
Abstract
The number of training samples per class (n) required for accurate Maximum Likelihood (ML) classification is known to be affected by the
number of bands ( p) in the input image. However, the general rule which defines that n should be 10p to 30p is often enforced universally in
remote sensing without questioning its relevance to the complexity of the specific discrimination problem. Furthermore, identifying this many
training samples is often problematic when many classes and/or many bands are used. It is important, then, to test how this generally accepted rule
matches common remote sensing discrimination problems because it could be unnecessarily restrictive for many applications. This study was
primarily conducted in order to test whether the general rule defining the relationship between n and p was well-suited for ML classification of a
relatively simple remote sensing-based discrimination problem. To summarise the mean response of n-to-p for our study site, a Monte Carlo
procedure was used to randomly stack various numbers of bands into thousands of separate image combinations that were then classified using an
ML algorithm. The bands were randomly selected from a 119-band Enhanced Thematic Mapper-plus (ETM+) dataset comprised of 17 images
acquired during the 2001 2002 southern hemisphere summer agricultural growing season over an irrigation area in south-eastern Australia.
Results showed that the number of training samples needed for accurate ML classification was much lower than the current widely accepted rule.
Due to the asymptotic nature of the relationship, we found that 95% of the accuracy attained using n = 30p samples could be achieved by using
approximately 2p to 4p samples, or 1 / 7th the currently recommended value of n. Our findings show that the number of training samples needed
for a simple discrimination problem is much less than that defined by the general rule and therefore the rule should not be universally enforced; the
number of training samples needed should also be determined by considering the complexity of the discrimination problem.
D 2005 Elsevier Inc. All rights reserved.
Keywords: Crop classification; Dimensionality; Training sample; Time-series; Multi-temporal; Maximum likelihood
1. Introduction
The Fcurse of dimensionality_ is the tendency for model
accuracy to initially increase as the number of variables (e.g.,
bands, p) used increases, but then reach a limit where accuracy
decreasesthe point where the model is overfit (Hand, 1981;
Hughes, 1968; Pal & Mather, 2003). This phenomenon is
called Fpeaking_ in the pattern recognition literature (Jain &
Waller, 1978) and in the remote sensing literature has been
* Corresponding author. CSIRO Land and Water, Private Bag No 5,

Wembley, WA 6913, Australia. Tel.: +61 8 9333 6705.
E-mail address: Tom.VanNiel@csiro.au (T.G. Van Niel).
0034-4257/$ - see front matter D 2005 Elsevier Inc. All rights reserved.
doi:10.1016/j.rse.2005.08.011
referred to as the Hughes (1968) effect (Foody & Arora, 1997;

Pal & Mather, 2003). Peaking is caused by a poor estimation of
the class probability density function (pdf) by the training data
(Hand, 1981). This poor estimation of the class pdf will
commonly occur in remote sensing classification when too
many bands are used with respect to training sample size (n). If
n is too small, the class pdf will not have enough precision to
accurately estimate the too complex feature space. This
phenomenon has been shown to affect the maximum likelihood
(ML) classifier (Pal & Mather, 2003) where the common
practice of modelling non-Gaussian remote sensing class pdfs
with a single Gaussian distribution would exacerbate the issue.
In order to avoid peaking, it is common practice in remote
sensing to ensure that n be comprised of at least 10 to 30 times
T.G. Van Niel et al. / Remote Sensing of Environment 98 (2005) 468 480
the number of discriminating wavebands (i.e., n = 10p to 30p

Mather, 1999; Piper, 1987). Most likely this n-to-p relationship
was originally intended to be a Frule of thumb_, but for most,
has turned into a generic rule to be applied universally. This is
partly because this relationship is very often presented as a
definitive statement or rule without qualification (James, 1985;
Jensen, 1986; Mather, 1999; Piper, 1992; Pal & Mather, 2003).
As such, this heuristic rule is often enforced without question,
even though common sense dictates that the number of training
samples required to achieve optimal accuracy will ultimately
depend upon the discrimination problem, which in turn
depends upon the characteristics of the data, the site, and the
resultant classification level desired.
The spatial, spectral, and temporal characteristics of both the
phenomena being mapped and the data itself (in conjunction
with the quantization level of the dataRoderick et al., 1996)
certainly impact upon discrimination of classes (McVicar et al.,
2002; Woodcock & Strahler, 1987), and thus on n. Likewise,
the level of detail of the resultant classification (e.g., Anderson
level), although often related to both the data and site
characteristics, could also alter the relationship between n
and p. For example, with a given data source and a constant
number of bands, to achieve the same classification accuracy, a
species-level classification would be expected to require more
training samples than a growth form-level classification of the
same site. Likewise, a study site containing low within-class
variance and high between-class variance, like irrigated crops
with significantly different planting dates, might need less
training data than at a site where classes have high within-class
and low between-class variance, like many Eucalypt forests.
These data- and site-specific differences make it difficult to
draw generic conclusions suitable for all studies, where
different data sources, study sites, and classification levels
are employed. Also, very often, published studies combine
only a few images prior to classification (for an exception, see
Key et al., 2001), so the large number of images required to
describe the n-to-p relationship means that it is not usually
studied with remotely sensed data. That is why the fundamental
studies on the relationship between training sample size and
dimensionality for ML classification which are cited in the
remote sensing literature are based on chromosomal data
(Piper, 1987, 1992) and probability theory (Hughes, 1968)
instead of remote sensing data. The remote sensing-based
studies have either concentrated on artificial neural network
classification accuracy (Hepner, 1990; Foody et al., 1995;
Foody & Arora, 1997), or have not tested a large enough range
of n to achieve peaking in ML classification (Dobbertin &
Biging, 1996; Pal & Mather, 2003). Subsequently, the
relationship between n and p cannot be defined from these
remote sensing-based studies, but only partially inferred
(Dobbertin & Biging, 1996; Pal & Mather, 2003).
Because only part of the n-to-p relationship has been
observed in those studies (where accuracy continues to increase
with increasing p), it is also easily misinterpreted. Consequently, little is known about how the generally accepted rule
defining the ratio of n-to-p matches common remote sensing
discrimination problems. Does this general rule define the
469
minimum requirement needed for a relatively simple remote

sensing-based discrimination problem or does it define better
what is required for a relatively complex case? We suggest that
in the wide range of remotely sensed classification applications, this rule is unnecessarily restrictive for many discrimination problems that potentially would not need as many
training samples as n = 10p to 30p to be accurately classified.
We test this hypothesis using broadband multi-temporal
classification in a Monte Carlo analysis of a dataset with high
temporal density (i.e., 17 ETM+ images in a single growing
season). We specifically aimed to determine if the general rule
defining that n = 10p to 30p was required for our relatively
simple discrimination problem. Due to the range of n and p
studied here, the current work provided a characterisation of
multi-temporal accuracy trends where the dependence on the
timing of image acquisition relative to the crop phenology was
minimised.
2. Methods and data
2.1. Study area and imagery
The study site is the 95,000 ha Coleambally Irrigation
Area (CIA), New South Wales, Australia (Fig. 1), where the
primary summer crop is irrigated rice. Rice uses the vast
majority of available water since it is both permanently
ponded between October and March and is planted on more
area than any other crop. The other major summer crops are
maize, sorghum and soybeans, which all use less water as
they are both intermittently furrow-irrigated and grown on
much less area. The CIA falls completely within the east
west overlap of two ETM+ scenes, allowing for twice as
many image acquisitions, nominally, every 8 days. This
provided 17 cloud-free images during the southern hemisphere summer growing season between October 2001 and
May 2002; see Table 1. The 17 images provided very good
coverage of the entire growing season. In every month except
December, at least 2 Fcloud-free_ images were acquired (Table
1). The mean acquisition interval was 13 days (SD = 8.24
days) with a maximum interval of 32 days (twice in the
growing season). Using a Monte Carlo approach to combine
bands from the 17 images meant that the dense temporal
sampling reduced the dependence of subsequent results on
specific acquisition dates and allowed the general n-to-p
relationship to be assessed. For an in-depth review of remote
sensing of irrigated rice as well as the impact of the timing of
image acquisition at the site, see Van Niel and McVicar
(2004a,b), respectively.
2.2. Validation data
The validation data were acquired from 2 sources: (1)
digitised field boundaries from 1.5 m resolution aerial
photographs acquired in the 2000 2001 and 2001 2002
southern hemisphere summer growing seasons (used for perfield classification); and (2) landholder surveys providing fieldlevel summer crop type data for 283 fields. Of these 283 fields,
470
P93/R84
P92/R84
Fig. 1. Location of CIA in New South Wales, Australia. The overlapping rectangles represent two ETM+ scenes, where P is Path and R is Row of the Landsat World
Reference System-2 (WRS-2). The dashed lines through the study site represent the Hyperion swath.
160 were rice (55,215 pixels), 46 maize (30,876 pixels), 14

sorghum (7105 pixels), and 63 soybeans (22,878 pixels). The
current research uses the same base image and validation
datasets as Van Niel and McVicar (2004b).
2.3. Image pre-processing
All 17 ETM+ images (Table 1) were purchased from the
Australian Centre for Remote Sensing as Fmap-oriented
data_, which are provided spatially over-sampled to 25 m
resolution in the multispectral bands and 12.5 m resolution in
the panchromatic band (http://www.ga.gov.au/acres/prod_ser/
landdata.htm). These images were converted from digital
numbers to exo-atmospheric radiances (W/m2/sr/Am) prior to
atmospheric correction using MODerate resolution TRANsmittance radiative transfer code (MODTRAN4, see Berk et
Table 1
The ETM dataset is shown
Date
DS1O
Image number
Channel number
08
17
02
09
25
04
05
12
13
22
10
17
02
11
18
27
04
007
016
032
039
055
064
096
103
135
144
160
167
183
192
199
208
215
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
17
8 14
15 21
22 28
29 35
36 42
43 49
50 56
57 63
64 70
71 77
78 84
85 91
92 98
99 105
106 112
113 119
Oct. 2001
Oct. 2001
Nov. 2001
Nov. 2001
Nov. 2001
Dec. 2001
Jan. 2002
Jan. 2002
Feb. 2002
Feb. 2002
Mar. 2002
Mar. 2002
Apr. 2002
Apr. 2002
Apr. 2002
Apr. 2002
May 2002
The southern hemisphere summer growing season at the CIA lasts from around
October to May. Days since 1 October (DS1O) represent the number of days
since the nominal start of the summer growing season.
al., 1999). Eight of the 17 ETM+ images were coincident

with Earth Observing-1 (EO-1) Hyperion images (for details
of EO-1, see Ungar et al., 2003).
The Hyperion-based water vapour and visibility parameters
(Datt et al., 2003) were used in the ETM+ MODTRAN4
processing whenever applicable. For each of the 17 ETM+
images, the MODTRAN4 visibility parameter was set to the
higher of two values: either (1) the visibility calculated from
coincident Hyperion imagery; or (2) a value of 50 km, based
on previous ETM+ time-series correction at the study site
(Van Niel et al., 2003). The water vapour parameter was set
for each of the 17 ETM+ images based on coincident
Hyperion data (if it existed) or set to the average water
vapour of 18 Hyperion images acquired over the two
consecutive summer growing seasons (mean = 2.05 g/cm2,
SD = 1.01 g/cm2).
After atmospherically correcting each image, the panchromatic band was then aggregated to 25 m and merged with the 6
multispectral bands ordered by central wavelength. Ground
control points collected from the 25 m multispectral bands
were refined based on the original 12.5 m panchromatic band
to attain higher precision. A transitive ties approach was used
to geometrically correct all 17 ETM+ images to the map base
(Jupp, 2001). The full set of 17 ETM+ images is herein referred
to as the FETM dataset_, and the 119 channels are herein
referred to as bands.
2.4. Class separability characterisation
Since the training set size needed for accurate classification
is largely dependant upon the separability of the classes, it is
important to characterise the classes used in this study so our
results can be related to other studies with different class
characteristics. The separability of the four crop classes were
characterised using the Jeffries Matusita (JM) distance
(Richards & Jia, 1999). The JM distance is commonly used
to define divergence of remote sensing classes (Key et al.,
2001; Koukoulas & Blackburn, 2001). A key component of the
JM distance calculation is the pooled Mahalanobis distance

(D 2pool):
D2pool

1

i
j
mi mj V
mi mj
2
where, for classes i and j, m is the mean vector of reflectance

values, and is the variance covariance matrix. Next, the
Bhattacharyya (B) distance is calculated as (Richards & Jia,
1999):

i j q
1
1

j
B D2pool ln
jIj
j
2
i
j
8
2
2
where one eighth is added to a measure of the similarity of the
two covariance matrices (using matrix determinants). Finally,
JM is calculated from B as (Richards & Jia, 1999):

JM 2 1 eB :
3
The JM distance was calculated for the six possible
combinations of the four classes. These were calculated using
the entire validation dataset (described in Section 2.2, above)
for each class in order to summarise the entire class response.
This was performed for all 17 dates in turn.
2.5. Determining the relationship between n and p
Three methods were used to summarise the relationship
between n and p. Each of these measured how well multiple
training data pdfs matched associated validation data pdfs.
These methods included: (1) assessing the mean and standard
deviation (SD) of the ratio of the training to validation
reflectances; (2) calculating the probability ( P) that the
validation data would be assigned a member of the correct
class based on the relevant training data; and (3) determining
the Kappa classification accuracy (K ) for each crop (the
training data were used to classify the image and the validation
data were used to assess accuracy). The K method was
calculated in two ways: either by: (i) the response of K to
varying p (K p ), for set values of n; or (ii) the response of K to
varying n (K n ), for set values of p. A Monte Carlo procedure
was used to randomly select and Fstack_ bands from the
available ETM dataset; the various training and validation
pixels of these random band combinations formed the basis of
these three comparisons. The general Monte Carlo procedure
will be described first, followed by the three summaries of the
n-to-p relationship.
2.5.1. Monte Carlo procedure
A Monte Carlo procedure involves two aspects: (1)
randomisation; and (2) integration or averaging. A Monte
Carlo procedure was used to summarise the general response
between n and p as described above. The Monte Carlo
procedure consisted of 8800 iterations in total (to clarify, here
Fiteration_ simply refers to the repeating of a process, not
providing a closer approximation of a solution to an equation).
Within an iteration, a predetermined number of bands were
randomly combined into a single image stack (see Fig. 2, steps
471
1 and 2). Next, a training set consisting of a predetermined

number of pixels (e.g., n = 10, 20, 30, or 40) were randomly
selected from the validation sets for each crop class (see Fig. 2,
step 3). The random image stacks were then classified using an
ML algorithm in ENVI/IDL in a supervised manner using the
associated randomly selected training pixels. The P of correctly
classifying the validation data from the training data was also
calculated independently from the classification procedure
(Fig. 2, steps 4 and 5).
In this paper Fig. 2, steps 1 through 6 are defined as a single
Fiteration_. Every n-to-p combination was run 100 times; a
single mean and SD was calculated for each set of 100
iterations as well as the associated training to validation ratio
(see Fig. 2, step 7) as using 100 iterations was determined to be
conservatively beyond the point where the mean accuracy
response became stable for our data (after Van Niel & Laffan,
2003). A stable mean response was defined here as the point
where the addition of more iterations (i.e., randomly selecting
different bands and training data as defined in Fig. 2) would not
change the mean accuracy of the Monte Carlo procedure.
Rapid areas of change occurred both initially and in the
vicinity of the peaking phenomena in the mean responses for
three of the summaries (the training to validation ratios, P, and
K p ). To capture the initial rapidly increasing response, p was
incremented by 1 from p = 2 to p = 10. To capture the rapidly
decreasing response, p was incremented by 1 for the final 5
bands surrounding the point where the training data became
overfit ( p overfit). In non-rapid areas of change between bands
p = 10 and p = p overfit 5, p was incremented by 5. For the 4
separate n values tested (i.e., n = 10, 20, 30, and 40), these
incrementing rules resulted in 60 possible n-to-p combinations.
As there were 100 iterations run for each of these 60
combinations, the total number of iterations summarised for
the training to validation ratios, P, and K p summaries was
6000. To summarise the response of K n , 7 separate values of n
(i.e., n = 1p, n = 2p, n = 3p, n = 4p, n = 5p, n = 10p, and n = 30p)
were calculated relative to four values of p (i.e., p = 5, 10, 15,
and 20), resulting in 28 possible n-to-p combinations. Again,
as 100 iterations were run for each n-to-p combination, 2800
iterations were run for the K n analysis.
2.5.2. Training to validation ratios of reflectance
When establishing the relationship between n and p, the first
logical step was to evaluate the response of critical first-order
statistics of reflectance between the training and validation
classes directly. Therefore, the means and SDs of the ratios of
training reflectance data to validation reflectance data were
summarised for the initial 6000 iterations. As the overfitting of
the classification is related to the breakdown of the
corresponding covariance matrix relationship (second-order
statistics), these ratios were not expected to reveal the
breakdown, but were still important in summarising the
relationship between n and p. This analysis also highlighted
the necessity for the next logical step of expressing the
precision of the covariance matrices between training and
validation datasets because that summary would reveal the
overfitting and subsequent breakdown in the classification. As
472
10
11
12
13
14
15
Images
16 17
Bands
1. Randomly select channels

e.g., 11
e.g., 33
e.g., 40
e.g., 61
e.g., 107
2. Stack channels into one image
3. Randomly select training points
4. Intersect both training and validation points with the stacked image
0.5
1.0
1.5
2.0
5. Classify stacked image using ML and calculate class probabilities (per-pixel)
Sy
R
R
R
R
Sr
Sy
Sr
Sy
6. Assess classification accuracy (per-field)
No
100 iterations?
Yes
7. Calculate single mean, and SD of the variable for 100 iterations (reflectance,
probability, and accuracy), and calculate mean and SD training to validation ratios
for reflectance
Fig. 2. An example of the Monte Carlo procedure is shown for a unique n p combination. In this example, a 5-band image stack (made up of bands 11, 33, 40, 61,
and 107) is classified into rice (R), maize (M), sorghum (Sr) and soybeans (Sy). Note, cells in steps 2, 3, and 5 denote fields not individual pixels.
stated earlier, this was demonstrated using analyses of P and K ,

which are discussed in turn below.
2.5.3. Probability ( P) assessment
The Gaussian assumption used in almost all ML classifiers
makes the class pdf estimation worse for many remote sensing
applications because remotely sensed data are often nonGaussian. In practice, this problem is mostly ignored as the
data are often considered to be near enough to normally
distributed and the procedure insensitive to small departures,
given a unimodal frequency distribution (Mather, 1999). When
a Gaussian (multivariate normal) model is assumed, the ML
classifier estimates the probability of class membership by

mainly using the covariance and Mahalanobis distance of the
training data (as well as the number of bands). The probability,
P, that a given pixel vector x (of p elements) is a member of a
specified class is determined by (Mather, 1999):
2
Px 2p0:5p jt j0:5 e0:5D :
5
2
Here, the Mahalanobis distance (D ) is of a general form used

in ML classifiers (Mather, 1999):
D2 x mt Vt 1 x mt
where the subscript t represents the training sample set.
2.5.4. K p and K n analyses of classification

The n-to-p relationship was also summarised using a more
pragmatic metric classification accuracy. Two classification
variants were run to address the following questions: (1) how
many bands are Fbest_ given a certain number of training
samples?addressed by the K p analysis; and (2) how many
training samples are needed given a certain number of
bands?addressed by the K n analysis. For every iteration,
the ML classifier was run with no probability threshold set. The
four classes were classified in a supervised manner (using the
spectral information from the randomly selected set of training
pixels) on a per-pixel basis. Standard methods were used to
classify each pixel based on the training data and should not be
confused with the non-traditional classifying of the validation
data means, which were previously used in the P analysis
above. These standard per-pixel classifications were summarised on a per-field basis depending on the majority perpixel class found within each field boundary. Classification
accuracy was assessed for each field in the validation set. The
per-field classification accuracy response demonstrated a
simple positive offset when compared to the per-pixel
classification accuracy results through the time series.
2.5.5. Classification accuracy
Per-field classification accuracy was calculated for each of
the 8800 iterations by first generating a 4-by-4 confusion
matrix. This confusion matrix was then reduced to a 2-by-2
array with respect to each of the four crop classes in turn, as
per Fitzgerald and Lees (1996). The resulting 2 classes were:
(1) the crop of interest (i.e., rice, maize, sorghum, or
soybean); and (2) the other three crops aggregated into a
single class (i.e., non-rice, non-maize, non-sorghum, or nonsoybean). This allowed for a standard K accuracy (Congalton,
1991; Rosenfield & Fitzpatrick-Lins, 1986) to be calculated
for each crop, and was considered to be a more applicationbased estimate of accuracy when compared to the traditional
conditional K (Rosenfield & Fitzpatrick-Lins, 1986), which is
overly influenced by users accuracy (Lass et al., 2000;
Stehman, 1997). To clarify, both K p and K n were calculated
using this general application-based definition of individual

class accuracy.
3. Results and discussion
3.1. Class separability characterisation using JM distance
The separabilities between all possible combinations of
classes were summarised in Fig. 3 for all 17 dates. As can be
seen in the figure, rice was most divergent from all other
classes. Also, every class showed good separability at some
time during the growing season (i.e., they approach high JM
values), indicating that high overall accuracy would be
expected using multi-temporal classification. This has been
confirmed in a previous study focussing on crop phenology and
timing of image acquisition at the site (Van Niel & McVicar,
2004b). Most importantly, this crop characterisation provides a
way to relate the results from this study with other studies (of
differing number and size classes) because it describes a key
element driving the relationship between n and p; that is, class
separability through time.
3.2. Determining the relationship between n and p
3.2.1. Training to validation ratios of reflectance
To determining the relationship between n and p, we first
evaluated the response of the mean and SD (first-order
statistics) of reflectance between the training and validation
classes directly. Fig. 4 shows that the ratio of training data to
validation data does not break down for the mean and SD of
each possible n-to-p combination. This confirms our previous
statements that these variables are not the key metrics for our
study as the ML classifier uses second-order statistics, and
therefore, the overfitting of the classification is related to the
breakdown of the corresponding covariance matrix relationship. Therefore, the best way to summarise the relationship
between n and p is to compare the covariance matrices between
the training and validation data. This was demonstrated using
analyses of P and K and is reported in the following sections. It
is worthwhile to note, that since the first-order statistic
2.0
1.5
JM Distance
When performing ML classification, the pixel vector x for

a particular class will be comprised of corresponding pixel
values in each band (i.e., one element for each band in the
image). In this way, each pixel location would be assessed in
turn to assign its class membership. For our study, we
calculated P for each of the initial 6000 iterations, but in each
case the pixel vector x was comprised of the mean value of
all the pixels of a particular class for each band. That is
x = m v, where m v is the mean vector of reflectance values in
each band for the entire validation set. This analysis
determined what the probability was that the training data
would classify the mean (and covariance) of the entire class
as a member of that class. Therefore, this analysis measured
how well the training data represented the class by providing
a direct measurement of where the ML classifier became
overfit, and was a logical second step used to establish the
relationship between n and p.
473
1.0
Maize/Sorghum
Maize/Soybeans
Sorghum/Soybeans
Rice/Maize
Rice/Sorghum
Rice/Soybeans
0.5
Nov .
Oct.
0.0
0
25
50
Dec.
75
Jan.
100
Feb.
125
Mar .
150
175
Apr .
200
May
225
No. Days Since 01 October 2001

Fig. 3. Class characterisation is shown using the JM distance through time.
474
(b.)
n =40 (offset=+0.6)
1.6
Ratio (training/validation)
(a.)
n =30 (offset=+0.4)
1.4
n =20 (offset=+0.2)
1.2
n =10
1.0
0
10
15
20
25
30
35
n =40 (offset=+0.6)
1.6
n =30 (offset=+0.4)
1.4
n =20 (offset=+0.2)
1.2
n =10
1.0
40
10
No. Bands (p)
20
25
30
35
40
No. Bands (p)
(c.)
(d.)
n =40 (offset=+0.6)
1.6
15
n =30 (offset=+0.4)
1.4
n =20 (offset=+0.2)
1.2
n =10
1.0
0
10
15
20
25
30
35
n =40 (offset=+0.6)
1.6
n =30 (offset=+0.4)
1.4
n =20 (offset=+0.2)
1.2
n =10
1.0
40
10
No. Bands (p)
15
20
25
30
35
40
No. Bands (p)
Fig. 4. Ratio of training reflectance means to validation reflectance means (solid line) and training reflectance standard deviations to validation reflectance
standard deviations (dashed line) for rice (a), maize (b), sorghum (c), and soybeans (d) are shown for the initial 6000 Monte Carlo iterations. Note, ratios for
n = 20 are offset by +0.20, ratios for n = 30 are offset by +0.40, and ratios for n = 40 are offset by +0.60 (i.e., if offsets were not applied, the lines would overlay
near 1.0 on the y-axis).
(a.) 100
(b.) 100
10 -20
Probability (P)
Probability (P)
10 -20
10 -40
10 -60
10 -80
10 -100
n=10
10 -300
0
n=30
n=20
10
15
20
25
n=40
30
35
40
10 -40
10 -60
10 -80
10 -100
n=10
10 -300
0
10
No. Bands (p)
15
20
25
n=40
30
35
40
No. Bands (p)
(c.) 100
(d.) 100
10 -20
Probability (P)
10 -20
Probability (P)
n=30
n=20
10 -40
10 -60
10 -80
10 -100
n=10
10 -300
0
n=30
n=20
10
15
20
25
No. Bands (p)
n=40
30
35
40
10 -40
10 -60
10 -80
10 -100
n=10
10 -300
0
n=30
n=20
10
15
20
25
n=40
30
35
40
No. Bands (p)
Fig. 5. P of classifying the entire class mean vectors based on the training data for rice (a), maize (b), sorghum (c), and soybeans (d) is shown for the initial 6000
Monte Carlo iterations. Note, the curves representing P for various values on n overlap prior to the relationship breaking down.
relationships do not break down, then classifiers that use firstorder statistics (e.g., minimum distance, or spectral angle
mapper) would not suffer from the peaking phenomenon and
would tend to cumulatively increase in classification accuracy
with the addition of more bands.
3.2.2. Probability ( P) assessment
The results of the P analysis are shown in Fig. 5. The point
where P decreased exponentially defined where the training
data pdf no longer represented the rest of the relevant class.
This happened in every case just prior to p = n, see Fig. 5 a to d.
This was where the classification was overfit and, after which,
poor accuracies would be expected. In this case, P was largely
a function of the Mahalanobis distance which has a reciprocal
relationship to P; Mahalanobis distance continues to increase
as p increases (for a constant n). The critical breakpoints for
each of the curves in Fig. 5 were associated with an opposing
increase in the Mahalanobis distance, which in turn caused
P Y 0 (or approximately <10 300 for double precision floating
point values, see Fig. 5).
The P analysis provided direct evidence that the classification model became overfit, and from this, we were able to
approximately determine where this occurred (just before p = n)
or near the point where the covariance matrix would become
singular. To represent this relationship in a more meaningful
way to users, two classification accuracy assessments were run:
(1) analysing the response of K p (varying p, given certain
values of n); and (2) analysing the response of K n (varying n,
given certain values of p).
3.2.3. ML classification accuracydetermining K p

K p was calculated for the four crops using the same image
stacks and training samples from the previous ratio and P
summaries (Fig. 6 a to d). These classifications used the
standard calculation of per-pixel probabilities and not those
previously used for the P analysis, above. The K p results
demonstrated the relationship between training sample size and
dimensionality in the case where n was the limiting factor and
the researcher wants to know how many bands should be used
to provide highest accuracy. Since Figs. 5 and 6 were
generated from the same data, they are directly comparable.
Three measures of p were determined and are plotted on
associated curves (see Fig. 6): (1) the optimum number of
bands ( p opt) defined by the p value that resulted in the
maximum accuracy; (2) the p value resulting in an accuracy
that was closest to 90% of the maximum accuracy ( p 90%); and
(3) the p value resulting in an accuracy closest to 75% of the
maximum accuracy ( p 75%). The p 90% and p 75% values were
selected because these points summarised the relationship
between n and p well, not because 75% and 90% were more
meaningful than other values. It is the general response that is
of interest here. These various p values were also summarised
in Table 2.
Table 2 shows that some general rules could be defined for
the relationship between p and n for our data. First, a good
estimate of p opt was approximately 0.5n. Secondly, p 90% could
be approximated by roughly n 4 (although it was n 3 when
n = 10). Thirdly, p 75% was almost always defined by n 2. This
meant that if number of bands were the limiting factor, highest
80
80
Kappa Accuracy (%)
(b.) 100
Kappa Accuracy (%)
(a.) 100
60
40
20
n=30
n=20
n=10
n=40
0
-20
0
10
15
20
25
30
35
60
40
20
10
80
Kappa Accuracy (%)
80
Kappa Accuracy (%)
(d.) 100
60
40
n=30
n=20
n=40
0
-20
0
10
15
20
25
No. Bands (p)
15
20
25
30
35
40
No. Bands (p)
(c.) 100
n=10
n=40
0
-20
0
40
n=30
n=20
n=10
No. Bands (p)
20
475
30
35
40
60
40
20
n=30
n=20
n=10
n=40
0
-20
0
10
15
20
25
30
35
40
No. Bands (p)
Fig. 6. Per-field K p for ML classifications are shown for n = 10, 20, 30, and 40 for rice (a), maize (b), sorghum (c), and soybeans (d). The points p opt (empty circle),
p 90% (filled circle) and p 75% (empty square) are shown on each curve. See the text for detailed definitions of these three terms.
476
Table 2
Critical break points are defined from the accuracy statistics from the 4 crop types and training sample curves shown in Fig. 6 including p opt, p 90% and p 75%; see text
for detailed definitions
Class
Rice
Maize
Sorghum
Soybeans
n = 10
n = 20
n = 30
n = 40
p opt
p 90%
p 75%
p opt
p 90%
p 75%
p opt
p 90%
p 75%
p opt
p 90%
p 75%
5
5
6
5
7
7
7
7
8
8
8
8
10
10
10
10
16
16
15
16
18
18
17
18
20
15
15
15
26
26
25
26
28
27
27
28
25
25
20
25
36
35
35
36
38
38
37
38
Because of the rule defined for non-rapid areas of change at the end of Section 2.5.1, the p opt statistic for n = 20, n = 30, and n = 40 is sampled to the nearest 5th
increment. The p 75% and p 90% values in each case had Fsingle-channel_ precision since the gap between the last two 5-channel iterations were always backfilled
(again see the end of Section 2.5.1 for full details).
ML accuracies (or p opt) could generally be expected when

using p = 0.5n. Also, ML accuracies were shown to decrease
dramatically, due to peaking, when p > n 2. This is not
surprising as the variance covariance matrix becomes singular
when p > n. For the range of training samples studied here, the
general rule that n = 10p to 30p resulted in too few bands being
used and subsequently, sub-optimal classification accuracy was
achieved (Fig. 6).
3.2.4. ML classification accuracydetermining K n
K n was determined for various numbers of training samples
(n = 1p, n = 2p, n = 3p, n = 4p, n = 5p, n = 10p, and n = 30p)
using the Monte Carlo procedure for a given number of bands
( p = 5, 10, 15, and 20), see Fig. 7a to d. These results
demonstrated the relationship between n and p from a more
common user perspective. That is, when the user has a certain
(a.) 100
p=10
p=20
p=15
number of bands and wants to know how many n are needed.

This test covered the range of bands commonly used for
classification, between 1 and 3 ETM+ images.
All curves demonstrated a large increase in accuracy when n
increased from 1p to 2p. After 2p, diminishing return in
accuracy per training sample added was seen as all the curves
flattened (Fig. 7a to d). These results were also summarised in
Table 3. In Table 3, three measures of n were summarised with
respect to p. These were: (1) the n value resulting in an
accuracy that was closest to 90% of the accuracy determined
using n = 30p training samples (n 90%); (2) the n value resulting
in an accuracy that was closest to 95% of this same n = 30p
accuracy (n 95%); and (3) the n value resulting in an accuracy
closest to 99% of the n = 30p accuracy (n 99%). The value 30p
was used to test the recommendation of Mather (1999),
introduced previously.
(b.) 100
Kappa Accuracy (%)
Kappa Accuracy (%)
80
60
40
20
0
-20
0
200
100
300
400
500
80
60
p=5
40
20
0
-20
0
600
p=10
300
400
500
(d.) 100
p=20
p=15
80
60
p=5
40
20
0
200
300
p=10
Kappa Accuracy (%)
p=10
100
400
No. Training Samples (n)
600
(c.) 100
Kappa Accuracy (%)
200
100
-20
0
p=20
p=15
p=5
500
600
80
p=20
p=15
p=5
60
40
20
0
-20
0
100
200
300
400
500
600
Fig. 7. Per-field K n for ML classifications are shown for p = 5, 10, 15, and 20 for rice (a), maize (b), sorghum (c), and soybeans (d). The seven symbols on each curve
represent, in order, n = 1p, n = 2p, n = 3p, n = 4p, n = 5p, n = 10p, and n = 30p.
477
Table 3
Summary of the relationship between accuracy of various n for set values of p
Class
p =5
p = 10
p = 15
p = 20
n 90%
n 95%
n99%
n90%
n 95%
n99%
n90%
n 95%
n99%
n90%
n 95%
n99%
Rice
Maize
Sorghum
Soybeans
3p
4p
2p
3p
4p
5p
3p
4p
30p
30p
4p
10p
<2p
3p
<2p
<2p
3p
4p
2p
3p
10p
10p
3p
10p
<2p
<2p
<2p
<2p
2p
3p
2p
2p
5p
10p
3p
5p
<2p
<2p
<2p
<2p
2p
2p
2p
2p
4p
5p
3p
5p
The values of n reported in the table relate to the accuracy determined using n = 30p, see the text for full details. Numbers of n are represented relative to p in the
body of the table.
As p increased from 5 to 20, the ratio of the number of n-top needed to attain the same relative proportion of accuracy
decreased (Table 3). Also, the three different metrics (i.e., n 90%,
n 95%, or n 99%) resulted in different ranges of recommended
numbers of training samples; n 99%4p to 30p, n 95%2p to 4p,
and n 90%2p to 3p (Table 3). This meant that developing a
single definition between n and p was difficult because it
varied on both the values considered to represent satisfactorily
high accuracy, and the number of bands used. For example, the
current rule of n needing to be 10p to 30p was about right if
n 99% was considered for p = 5 or p = 10. However, if either n 90%
or n 95% was considered to represent satisfactorily high
accuracy for the same p, then a smaller n would be required
(i.e., n 2p to 4p). Likewise, if p = 15 to 20, then for n 90%,
n 95%, or n 99%, the n required would also be far less than 10p to
30p (i.e., n 2p to 5p).
The Foptimum_ point along each curve was defined here
as being of a sufficiently high accuracy after which little
gain in accuracy was attained per extra training sample
added. We selected n 95% as it was positioned where the
curves were: (1) not varying dramatically, and (2) not in a
position of large diminishing returns in accuracy (Fig. 7 and
Table 3). Based on the n 95% metric, the ideal numbers of
samples ranged from about 2p to 4p. This showed that 95%
of the accuracy was retained at our study site by using 1 /
7th the number of training samples recommended by the
previously accepted rule (based on n = 30p). There was no
metric where n = 10p to 30p were consistently needed across
the range of p values tested.
3.2.5. Similarity analysis instead of significance testing
Performing some statistically rigorous analysis on the main
results of the two classification analyses above (K p and K n )
could greatly strengthen the interpretation of results. For
example, in the K p analysis, it would be interesting to know
how similar p opt was to p 75%. This would show the impact of
using more bands than optimal (for a given n) on classification
accuracy results. Likewise, for the K n analysis, as we
previously recommended using n 95% to define how many n
are required, the most interesting comparison would be whether
this 95% accuracy level (n 95%) was very similar (statistically)
to the accuracy defined by the recommended rule (n = 30p).
This would show the impact of using less training samples than
recommended for a given p. If they were very similar, then it
would strengthen the argument that for our dataset, attaining
n = 30p samples was not necessary.
However, as our study was strictly inductive in nature,

performing standard significance testing to draw inferences on
whether mean accuracies determined from the Monte Carlo
procedure were significantly different than others was unadvisable. These would reduce the impact and defeat the purpose of
the study as: (1) determining the alpha threshold is subjective;
and (2) this subjective decision would subsequently impose a
judgement of statistical significance that has no meaning in an
application sense (Anderson et al., 2000; Germano, 1999).
Also, as significance testing is highly influenced by sample
size (if sample size is large enough, any test will be significant,
Anderson et al., 2000; Germano, 1999), there is often a bias
towards deeming any change significant when traditional
techniques are used for testing significance of geocomputational or remote sensing analyses. For example, when using a
standard t-test of the means between p opt and p 75%, all 16
possible combinations (four crop classes for each of n = 10, 20,
30, 40) were determined significant at the a = 0.001 level.
Consequently, we deemed such a summary as not being useful.
Hence, as we performed a Monte Carlo analyses, a statistical
test based on these multiple iterations would suit itself well to
measure the similarity between mean accuracies of our major
analyses. We modified the Monte Carlo significance testing
procedures used for the Fgeographical analysis machine_
(Openshaw et al., 1999; Steenberghen et al., 2004). The basic
premise of this methodology compares the multiple observed
instances of the Monte Carlo analysis (observed) to some
expected value (expected). In our case, we test 4 different
expected values based on the mean (l) and SD of the 100
Kappa accuracies for a particular reference variable (K ref; e.g.,
p opt) to each of the 100 observed Kappa values of some other
test variable (K test; e.g., p 95%). This allowed for a calculation of
the probability of exceeding the expected value. We summarised the sensitivity of these probabilities by adjusting the
expected value and do not infer statistical significance from
them due to the reasons stated above (Anderson et al., 2000;
Germano, 1999). The results of this analysis, then, were a
generalised assessment of similarity, which allowed for
comparisons to be made across analyses. For example, it
allowed us to answer the question: is the accuracy attained
using n 95% more similar to that attained using n = 30p than the
accuracy from using p 75% is to that of p opt?the answer is
yes (shown below).
We test four different hypotheses and calculate the
probability of four different conditions (condx ) for each
hypothesis test. The probability that condx would arise given
478
that the expected accuracy is similar to the multiple observed

accuracies is shown in Eq. (7a) through (7c):
mean (minus 1, 2, or 3 SDs) of the 100 accuracies from the

n-to-p combination representing p opt;
S probcondx jH0
(2) H0:p opt,p 75%, or how similar each of the 100 accuracies
from the n-to-p combination representing p 75% were to the
mean (minus 1, 2, or 3 SDs) of the 100 accuracies from the nto-p combination representing p opt;
7a
where S is the similarity (expressed as a probability from 0 and

1) between expected and observed values as measured in the
following form for each of four conditions:
condx
100
X
if observed i > expected; then 1; else 0
7b
i1
where x = 0, 1, 2, and 3. For each condition, the multiple

observed accuracies for each n-to-p combination are the 100
individual K test values, while the four separate expected values
change based on the following:
8

>
x 0; expected m K ref
>

>
<
x 1; expected m K ref 1SD K ref

:
7c
condx
>
>

>
:
The similarity analysis was tested on four hypotheses; the
first two null hypothesis gauged the impact of using more
bands than was optimal given a set value of n, with the
remaining two null hypotheses assessing the impact of using
less training samples than the recommended n = 30p. Specifically they are:
(1) H0:p opt,p 90%, or how similar each of the 100 accuracies
from the n-to-p combination representing p 90% were to the
(a.)
1.00
(3) H0:n 30p ,n 95%, or how similar each of the 100 accuracies
from the n-to-p combination representing n 95% were to the
mean (minus 1, 2, or 3 SDs) of the 100 accuracies when
n = 30p; and
(4) H0:n 30p ,n 90%, or how similar each of the 100 accuracies
from the n-to-p combination representing n 90% were to the
mean (minus 1, 2, or 3 SDs) of the 100 accuracies when
n = 30p.
Since we showed in Section 3.2.4 that the number of n
needed to attain 95% of the accuracy was approximately
15% for that of attaining the accuracy by using n = 30p, the
third null hypothesis analysis was particularly useful as it
helped define whether acquiring the remaining 85% of the
training samples made a difference in the results. The
results of these four analyses are shown in Fig. 8; the
responses for four classes (i.e., rice, corn, sorghum and soy)
were averaged to achieve a single response for each H0
tested.
Fig. 8 reveals an indirect relationship between similarity and
both the size of n (Fig. 8a and b) and the size of p (Fig. 8c and
d). This was expected as the mean accuracy increased and the
(b.)
H0:popt ~ p90%
H0:popt ~ p75%
0.75
n=10
n=40
n=30
n=20
Probability
Probability
0.75
1.00
0.50
0.25
0.50
n=10
n=20
n=40
0.25
0.00
0
0.00
0
1.00
(d.)
H0:n30p ~ n95%
p=5
1.00
H0:n30p ~ n90%
0.75
p=5
p=10
0.50
Probability
Probability
0.75
Condition
Condition
(c.)
n=30
p=15
p=20
0.25
0.00
0
p=10
0.50
p=15
p=20
0.25
Condition
0.00
0
Condition
Fig. 8. Probabilities that condx would arise given that H0 is true are shown for H0: p optp 90% (a), H0: p optp 75% (b), H0: n 30p n 95% (c), and H0: n 30p n 90% (d).
Higher probabilities are indicative of higher similarity between v ref and v i .
SD of accuracy decreased when either n or p were increased.

For example, Fig. 8b shows that p opt was much more similar to
p 75% when n = 10 than when n = 40. Likewise, the relationship
holds true between n 95% and n = 30p; n 95% is much more
similar to n = 30p when p = 5 than when p = 20; see Fig. 8c.
Also, as expected, the value of similarity increased as the
expected value was decreased systematically by increments of
1 SD (see Fig. 8 and Eq. (7c)).
Importantly, the similarity analysis showed that p opt and
p 90% were quite similar with between 15% and 45% of p 90%
values exceeding the mean p opt value (Cond0) and 70% to 95%
of p 90% values exceeding the mean p opt value minus 3 SDs
(Cond3) (Fig. 8a); much more alike than p opt was to p 75% with
Cond0 ranging between 5% and 35% and Cond3 ranging
between 35% and 95%. This meant that in our study, using
more bands than optimal did not result in a large reduction in
accuracy when using p 90% bands with the ML algorithm. The
penalty in accuracy was larger when using p 75% bands, where
the results became relatively divergent from p opt (Fig. 8b). Fig.
8c and d are much alike, revealing that the accuracies attained
using either n 95% or n 90% were similar to those when using
n = 30p. In particular, the range of similarity between the
accuracy from n 95% and n = 30p was nearly the same as that
seen for the comparison of p opt and p 90% with Cond0 ranging
between 15% and 45% and Cond3 ranging between 70% and
95%. This was particularly useful for analysing the K n results
and supported further our suggestion that using less training
samples for our data did not result in a great loss in accuracy.
4. Conclusions
This study showed that the generally accepted rule
stipulating that n = 10p to 30p was unnecessarily conservative
for the relatively simple discrimination problem at our study
site. Subsequently, the number of training samples needed for
accurate classification should not strictly be determined as a
function of the number of discriminating wavebands, but also
upon the complexity of the discrimination problem. We
demonstrated that assessing the probability that a class_ mean
reflectance vector be assigned to that class based on relevant
training data provided a direct way to determine where
the classification became overfit. For our study site, we also
answered two questions: (1) how many bands were Fbest_
given a certain number of training samples?; and (2) how
many training samples were needed given a certain number of
bands?.
For 1 above, we found that maximum ML classification
accuracy occurred at our site when p ; 0.5n. However, this
accuracy remained relatively stable until p ; n 2, and was
very stable up to around p ; n 4. Importantly, these results
showed that restricting p by the commonly used rule when n
was the limiting factor would produce sub-optimal classification accuracy because p would be too small. As for question
2 above with respect to our study site, although we showed
that accuracy continued to increase with an increase in
training samples, a clear pattern of diminishing returns was
evident near n ; 2p to 4p (based on n 95% for p = 20 and p = 5,
479
respectively). This diminishing response meant that 95% of

the accuracy was attained while using 15% or less of the
suggested 30p training samples (James, 1985; Jensen, 1986;
Mather, 1999; Piper, 1992; Pal & Mather, 2003). Most
importantly, although the results of this research are specifically tied to our study site, they have clearly demonstrated
that the recommendation governing the training sample size/
dimensionality relationship should not be treated as a rule, but
rather as a Frule of thumb_ from which the required training
sample size for any particular application could deviate. We
recommend that further research inspect the relationship
between n and p for a more complex remote sensing-based
discrimination problem in order to understand how this Frule
of thumb_ matches a very different case study to the one
performed here.
Acknowledgements
This research was funded by CSIRO Land and Water, and
the CRC for Sustainable Rice Production, project 1105. Thanks
to Arun Tiwari, Reuben Robinson, David Klienert and the
Coleambally landholders for access to the validation data.
Thanks also to Warrick Dawes, John Gallant and Doug Ramsey
for making comments that improved a draft of this manuscript.
Many thanks to David Jupp, CSIRO Earth Observation Centre,
for a useful discussion concerning variance covariance and
ML classification, and for establishing the method used for
MODTRAN-4 processing of all the ETM+ data used here.
Thanks to the anonymous reviewers and the editor for
comments that greatly improved this paper.
References
Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis
testing: Problems, prevalence, and an alternative. Journal of Wildlife
Management, 64, 912 923.
Berk, A., Anderson, G. P., Bernstein, S. L., Acharya, P. K., Dothe, H., Matthew,
M.W., et al. (1999). MODTRAN4 radiative transfer modeling for
atmospheric correction. Proceedings of the society of photo-optical
instrumentation engineers, 348 353.
Congalton, R. G. (1991). A review of assessing the accuracy of classifications
of remotely sensed data. Remote Sensing of Environment, 37, 35 46.
Datt, B., McVicar, T. R., Van Niel, T. G., Jupp, D. L. B., & Pearlman, J. S.
(2003). Pre-processing EO-1 Hyperion hyperspectral data to support the
application of agricultural indices. IEEE Transactions on Geoscience and
Remote Sensing, 41, 1246 1259.
Dobbertin, M., & Biging, G. S. (1996). A simulation study of the effect
of scene autocorrelation, training sample size and sampling method on
classification accuracy. Canadian Journal of Remote Sensing, 22,
360 367.
Fitzgerald, R. W., & Lees, B. G. (1996). Temporal context in floristic
classification. Computers & Geosciences, 22, 981 994.
Foody, G. M., & Arora, M. K. (1997). An evaluation of some factors affecting
the accuracy of classification by an artificial neural network. International
Journal of Remote Sensing, 18, 799 810.
Foody, G. M., McCulloch, M. B., & Yates, W. B. (1995). The effect of training
set size and composition on artificial neural network classification.
International Journal of Remote Sensing, 16, 1707 1723.
Germano, J. D. (1999). Ecology, statistics, and the art of misdiagnosis: The
need for a paradigm shift. Environmental Reviews, 7, 167 190.
Hand, D. J. (1981). Discrimination and classification. New York John Wiley
and Sons.
480
Hepner, G. F. (1990). Artificial neural network classification using a minimal

training set: Comparison to conventional supervised classification. Photogrammetric Engineering and Remote Sensing, 56, 469 473.
Hughes, G. F. (1968). On the mean accuracy of statistical pattern recognizers.
IEEE Transactions on Information Theory, 14, 55 63.
Jain, A. K., & Waller, W. G. (1978). On the optimal number of features in
the classification of multivariate Gaussian data. Pattern Recognition, 10,
365 374.
James, M. (1985). Classification algorithms. New York Wiley.
Jensen, J. R. (1986). Introductory digital image processing: A remote sensing
perspective. Englewood Cliffs Prentice-Hall.
Jupp, D. L. B. (2001), Background to algorithms and use of MOSMOD
(MOSaic MODel) within the microBRIAN image processing system.
Unpublished Notes. CSIRO, Canberra, p.8.
Key, T., Warner, T. A., McGraw, J. B., & Fajvan, M. A. (2001). A comparison
of multispectral and multitemporal information in high spatial resolution
imagery for classification of individual tree species in a temperate
hardwood forest. Remote Sensing of Environment, 75, 100 112.
Koukoulas, S., & Blackburn, G. A. (2001). Introducing new indices for
accuracy evaluation of classified images representing semi-natural woodland environments. Photogrammetric Engineering and Remote Sensing, 67,
499 510.
Lass, L. W., Shafii, B., Price, W. J., & Thill, D. C. (2000). Assessing
agreement in multispectral images of Yellow Starthistle (Centaurea
solstitialis) with ground truth data using Bayesian methodology. Weed
Technology, 14, 539 544.
Mather, P. M. (1999). Computer processing of remotely-sensed images.
(Second Edition). Chichester John Wiley & Sons.
McVicar, T. R., Davies, P. J., Yang, Q., & Guanglu, Z. (2002). An introduction
to temporal-geographic information systems (TGIS) for assessing, monitoring and modelling regional water and soil processes. In T. R. McVicar, L.
Rui, J. Walker, R. W. Fitzpatrick, & C. Liu (Eds.), Regional water and soil
assessment for managing sustainable agriculture in China and Australia
(pp. 205 223).
Openshaw, S., Turton, I., Macgill, J., & Davy, J. (1999). Putting the geographical
analysis machine on the internet. In B. Gittings (Ed.), Innovations in GIS,
Vol. 6 (pp. 121 131). London Taylor and Francis.
Pal, M., & Mather, P. M. (2003). An assessment of the effectiveness of decision
tree methods for land cover classification. Remote Sensing of Environment,
86, 554 565.
Piper, J. (1987). The effect of zero feature correlation assumption on

maximum likelihood based classification of chromosomes. Signal
Processing, 12, 49 57.
Piper, J. (1992). Variability and bias in experimentally measured classifier error
rates. Pattern Recognition Letters, 13, 685 692.
Richards, J. A., & Jia, X. (1999). Remote sensing digital image analysis. Berlin
Springer-Verlag.
Roderick, M. L., Smith, R. C. G., & Cridland, S. W. (1996). The precision of
the NDVI derived from AVHRR observations. Remote Sensing of
Environment, 56, 57 65.
Rosenfield, G. H., & Fitzpatrick-Lins, K. (1986). A coefficient of agreement as
a measure of thematic classification accuracy. Photogrammetric Engineering and Remote Sensing, 52, 223 227.
Steenberghen, T., Dufays, T., Thomas, I., & Flahaut, B. (2004). Intra-urban
location and clustering of road accidents using GIS: A Belgian example.
International Journal of Geographical Information Science, 18, 169 181.
Stehman, S. V. (1997). Selecting and interpreting measures of thematic
classification accuracy. Remote Sensing of Environment, 62, 77 89.
Ungar, S. G., Pearlman, J. S., Mendenhall, J. A., & Reuter, D. (2003).
Overview of the Earth Observing One (EO-1) mission. IEEE Transactions
on Geoscience and Remote Sensing, 41, 1149 1159.
Van Niel, K. P., & Laffan, S. W. (2003). Gambling with randomness: The use of
pseudo-random number generators in GIS. International Journal of
Geographical Information Science, 17, 49 68.
Van Niel, T. G., & McVicar, T. R. (2004a). Current and potential uses of optical
remote sensing in rice-based irrigation systems: A review. Australian
Journal of Agricultural Research, 55, 155 185.
Van Niel, T. G., & McVicar, T. R. (2004b). Determining temporal windows of
crop discrimination with remote sensing: A case study in south-eastern
Australia. Computers and Electronics in Agriculture, 45, 91 108.
Van Niel, T. G., McVicar, T. R., Fang, H., & Liang, S. (2003). Calculating
environmental moisture for per-field discrimination of rice crops. International Journal of Remote Sensing, 24, 885 890.
Woodcock, C. E., & Strahler, A. H. (1987). The factor of scale in remote
sensing. Remote Sensing of Environment, 21, 311 332.

2005 - On The Relationship Between Training Sample Size and Data Dimensionality

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2005 - On The Relationship Between Training Sample Size and Data Dimensionality

Hochgeladen von

Copyright:

Verfügbare Formate

Remote Sensing of Environment 98 (2005) 468 480

On the relationship between training sample size and data dimensionality:

CSIRO Land and Water, Private Bag No 5, Wembley, WA 6913, Australia

* Corresponding author. CSIRO Land and Water, Private Bag No 5,

referred to as the Hughes (1968) effect (Foody & Arora, 1997;

the number of discriminating wavebands (i.e., n = 10p to 30p

minimum requirement needed for a relatively simple remote

160 were rice (55,215 pixels), 46 maize (30,876 pixels), 14

al., 1999). Eight of the 17 ETM+ images were coincident

JM distance calculation is the pooled Mahalanobis distance

where, for classes i and j, m is the mean vector of reflectance

1 and 2). Next, a training set consisting of a predetermined

1. Randomly select channels

2. Stack channels into one image

3. Randomly select training points

5. Classify stacked image using ML and calculate class probabilities (per-pixel)

6. Assess classification accuracy (per-field)

stated earlier, this was demonstrated using analyses of P and K ,

classifier estimates the probability of class membership by

Px 2p0:5p jt j0:5 e0:5D :

Here, the Mahalanobis distance (D ) is of a general form used

2.5.4. K p and K n analyses of classification

using this general application-based definition of individual

When performing ML classification, the pixel vector x for

No. Days Since 01 October 2001

No. Bands (p)

No. Bands (p)

No. Bands (p)

No. Bands (p)

No. Bands (p)

No. Bands (p)

No. Bands (p)

No. Bands (p)

3.2.3. ML classification accuracydetermining K p

Kappa Accuracy (%)

Kappa Accuracy (%)

Kappa Accuracy (%)

Kappa Accuracy (%)

No. Bands (p)

No. Bands (p)

No. Bands (p)

No. Bands (p)

ML accuracies (or p opt) could generally be expected when

number of bands and wants to know how many n are needed.

Kappa Accuracy (%)

Kappa Accuracy (%)

No. Training Samples (n)

No. Training Samples (n)

No. Training Samples (n)

No. Training Samples (n)

However, as our study was strictly inductive in nature,

that the expected accuracy is similar to the multiple observed

mean (minus 1, 2, or 3 SDs) of the 100 accuracies from the

where S is the similarity (expressed as a probability from 0 and

if observed i > expected; then 1; else 0

where x = 0, 1, 2, and 3. For each condition, the multiple

SD of accuracy decreased when either n or p were increased.

respectively). This diminishing response meant that 95% of

Hepner, G. F. (1990). Artificial neural network classification using a minimal

Piper, J. (1987). The effect of zero feature correlation assumption on

Das könnte Ihnen auch gefallen

Px 2p0:5p jt j0:5 e0:5D :