Determining Optimum Structure For Artificial Neural Networks

In Proceedings of the 25th Annual Technical Conference and Exhibition of the Remote Sensing Society, Cardiff, UK, pp.
675-682, 8-10 September 1999.
Determining Optimum Structure for Artificial Neural Networks

Taskin Kavzoglu
School of Geography, The University of Nottingham, NG7 2RD, Nottingham. Telephone: +44 (115) 951 5450 Fax: +44 (115) 951 5249 Email: kavzoglu@geography.nottingham.ac.uk
Abstract
Artificial Neural Networks (ANNs) have attracted increasing attention from researchers in many fields, including economics, medicine and computer processing, and have been used to solve a wide range of problems. In remote sensing research, ANN classifiers have been used for many investigations such as land cover mapping, image compression, geological mapping, and meteorological image classification, and have generally proved to be more powerful than conventional statistical techniques, especially when the training data are not normally distributed. The use of ANNs requires some critical decisions on the part of the user, which may affect the accuracy of the resulting classification. In this study, determination of the optimum network structure, which is one of the most important attributes of a network, is investigated. The structure of the network has a direct effect on training time and classification accuracy. Although there is some discussion in the literature of the impact of network structure on the performance of the network, there is no certain method or approach to determine the best structure. Investigations of the relationship between the network structure and the accuracy of the classification are reported here, using a MATLAB tool-kit to take the advantage of scientific visualisation. The effect of the composition of the training data on network structure is also investigated.
1. Introduction
Artificial Neural Networks (ANNs) have been used in many fields for a variety of applications, and proved to be reliable. In spite of their unique advantages such as their non-parametric nature, arbitrary decision boundary capabilities, and easy adaptation to different types of data, they possess some inherent limitations. These limitations result from some factors, which may affect the accuracy of the classification. These factors can be divided into two main groups; external factors and internal factors. External factors include the characteristics of the input data set (multisensor, multispectral etc.), and scale of the study, whilst internal factors are the choices of an appropriate network structure, initial weights, number of iterations, transfer function and learning rate.
Understanding these factors and choosing appropriate parameter values are key issues for a successful ANN application. In this study, the effect of network structure on the performance of the classifier is investigated. As the size of the input layer is equal to the number of input features and that of output layer is equivalent to the number of output classes, the adjustable part in the neural network is the number and composition of the hidden layer(s). Unfortunately, there is little discussion in the literature of methods to fix the optimum number of hidden layers and units (or nodes) for a particular problem, so that researchers generally use trial and error method for this purpose.
2. The Importance of Artificial Neural Network Structure

In the case of layered neural network architectures, network size is not only related to the number of layers, but also related to the number of nodes for each layer and number of connections between these nodes. For a given data set there may be an infinite number of network structures relevant to learn the characteristics of the data. The question is: what size of network will be best for a specific data set. Unfortunately, it is not easy to answer this question. The neural network architecture that gives the best result for a particular problem can only be determined experimentally. The quality of a solution found by a neural network is strongly dependent on the network size used. In general, the network size affects network complexity, and learning time, but most importantly, it affects the generalisation capabilities of the network. Generalisation is the ability of the neural network to interpolate and extrapolate data that it has not seen before (Atkinson and Tatnall 1997). The power of an ANN depends on how well it generalises from the training data. The size and the characteristics of the training data set together with the number of iterations are the other factors affecting the generalisation capabilities of a neural network. Large networks take a long time to learn the characteristics of the data whilst small networks may become trapped into a local error minimum and may not learn from the training data. When the number of nodes in the hidden layer(s) increased, the network can learn more complex data by locating decision boundaries more accurately in the feature space. However, while reducing the generalisation capabilities of the network, this increases required time to train the network. On the other hand, since there is an almost linear correlation between the number of samples required for the training process and the number of hidden units, large networks generally require more number of training samples than small networks to achieve good generalisation performance. In many applications, a limited number of samples is available. Consequently, using large networks for such data sets may lead the network to produce unsatisfactory results. The number of input nodes is directly related to the size of hidden layer as each node represent different characteristics of the pattern used. More input nodes mean more information to determine class boundaries (to distinguish classes from each other in feature space). The input layer can be expanded simply by adding new data sources as additional neurones, but this increases the computation time (N) by the order of N2. In other words, if the size of input data is doubled, the time required to train the network will be four times more than initial one. Therefore, new data sets should be added only if they contribute to a significantly improved classification. Another parameter is the number of output classes, which is decided at the initial stage. A larger number of output classes usually makes the problem more complex since the network will be determining more complex class boundaries in feature space. Therefore, it is very important to choose the optimum number of output classes to avoid unnecessary training.
Some problems require more than one hidden layer to train a network properly, whereas others require only a single hidden layer. The use of unnecessary additional hidden layers can make the network too specific and use more training time. If too few hidden units are used then the network will fail to achieve a satisfactory performance, whereas if too many hidden units are employed then the network will tend to 'memorise' the patterns in the training set (and become overspecific to the training data) and, hence, give poor performance for patterns that are not included in the training data. The best generalisation performance is obtained by trading the training error against network complexity (Le Cun et al. 1990). It should be noted that a smaller network is more likely to generalise well, since it extracts the essential and significant characteristics of the training data. Lippmann (1987) shows that a Multilayer Perceptron (MLP) with one hidden layer can implement arbitrary convex decision boundaries. In addition, Cybenko (1989) has pointed out that a network with one hidden layer can form an arbitrarily close approximation to any continuous non-linear mapping, assuming only that the transfer function computed by a neurone is nonconstant, bounded, continues and monotone increasing. However, these conclusions do not suggest that there is no benefit having more than one hidden layer. For some problems a small two hidden layer network can be used where a single hidden layer network would require large number of nodes (Chester 1990). While the use of multiple hidden layers provides some potential benefits, it does not solve the problem of determining the appropriate number of hidden nodes. It simply extends the problem from one to multiple layers. Two approaches to the estimation of the proper size of the neural networks are discussed in the literature. One is to start with a small network and iteratively increase the number of nodes in the hidden layer(s) until satisfactory learning is achieved. The techniques that are based on this approach are called constructive techniques (Hirose et al. 1991). However, there are problems for small networks being sensitive to initial conditions and learning parameters. These networks are also more likely to become trapped in a local minimum as the error surface of a smaller network is more complicated and includes more local minima compared to the error surface of a larger network (Bebis and Georgiopoulos 1994). As a consequence of the algorithm, a number of networks must be trained to find the optimum network structure, which takes a long time. The second approach is to begin with a larger network and make it smaller by iteratively eliminating nodes in the hidden layer(s) or interconnections between nodes. These types of algorithms are called pruning. Optimum brain damage, optimum brain surgeon, and skeletonization are the major pruning techniques in use. After a network is trained to a desired solution with the training data, units (hidden layer nodes or interconnections) are analysed to find out which ones are not participating in the solution, then they are eliminated (Kavzoglu and Mather 1998). More information on pruning techniques can be found in Reed (1993). Neural network learning process is very similar to curve fitting problem using a polynomial function. While determining the best curve for a given a set of data points, the first decision to be made is to find out the optimum order of the polynomial function to be used. Then, the coefficients of the function in order to minimise the sum of the squared differences between required and predicted values are estimated. Thus, it is possible to evaluate any value of the polynomial function given in a data point, even for data points that were not in the initial data set. If the order of the polynomial chosen is very low, the approximations obtained are not good, even for points contained in the initial data set. On the other hand, if the order of the polynomial chosen is very high, bad values may be computed for points not included in the initial data set. This illustrative example is given as Figure 1. Similarly, a network having a structure simpler than necessary cannot give good approximations even for patterns included
in its training set and a more complicated than necessary structure, "overfits" the training data, that is, it performs nicely on patterns included in the training set but performs very poorly on unknown patterns (Bebis and Georgiopoulos 1994). This study aims to get some insights into the determination of optimum structure for the neural networks. For the delineation of seven land cover classes six network structures (68-7, 6-10-7, 6-11-7, 6-13-7, 6-15-7, and 6-5-5-7) have been trained and tested in order to find out the relationship between the size of the neural network and the accuracy of the classification. Classification results are analysed using a range of scientific visualisation techniques. Two and three-dimensional animations are used to observe the training processes and the behaviour of the neural network classifier.
Data Points Linear Poly. (2nd degree) Poly. (4th degree)
0 0 1 2 3 4 5 6
Figure 1. Overfitting problem in curve fitting. As can be seen from Figure 1, when the order of the polynomial function is increased, the curve is fitted very closely to the points, which is called overfitting. At some point the curve is exactly fitted to data points that carry some degree of error. Then, the curve totally loses its generalisation capabilities. While the second order polynomial appears to be relevant to represent the points, the forth order polynomial gives especially poor results for the areas outside the points. The optimum choice for the order of the polynomial is related to generalisation capabilities. This case is also valid for artificial neural networks in that large networks fit the boundaries too closely to data points whereas small networks cannot even represent the data points.
3. Test Site and Data

Multi-sensor data (SIR-C SAR and SPOT HRV imagery) were used to identify land-cover classes for a study area of 76.6 km located in a fairly flat agricultural area close to Thetford,
Norfolk, in eastern England. Ground truth data were available from an earlier study. Quadpolarised L-band (~24cm wavelength) SIR-C SAR data were acquired by NASAs SIR-C (Shuttle Imaging Radar-C) system on April 15, 1994. A SPOT HRV image, acquired for the same area on May 13, 1994, is also used. Radar images are highly susceptible to speckle effects, which can lead to confusion between classes. A 5 x 5 median filter was used to reduce the effects of speckle noise before the two image sets were geo-referenced to the Ordnance Survey of Great Britains National Grid, using an affine transformation. The RMSE values of the reference points chosen for image transformation were less than one pixel. The land cover of the study area can be described in terms of seven land cover classes (sugar beet, wheat, peas, forest, winter barley, potatoes and linseed) covering the bulk of the area.
4. Results
In order to investigate the effects of the network structure on the performance, the training process and classification results have been analysed using some scientific visualisation techniques including 3-D representations of the data and animations. Training process has been observed in three ways. Firstly, trained networks have been analysed by using a graph representation of the net weights (Figure 2). This type of representation is very useful in observing the effects of network learning on the network weights. For every 150 iterations trained networks and the results are saved and converted to a graph form. After 15,000 iterations, 100 graphs have been produced. These graphs have been saved individually as GIF images and then combined to produce a GIF animation, which helps to understand the network training procedure and the behaviour of artificial neural networks.
Figure 2. Analysis of network weights in training process.
Secondly, quantitative assessment based on the contingency matrix has been carried out by using bar graphs. The result of this assessment is portrayed in Figure 3. In these graphs individual class accuracies, overall accuracy and the number of unrecognised pixels are represented by bars, and these results are saved for every 150 iterations. All the graphs are saved as GIF images and combined to form a GIF animation, which is found to be very useful in analysing the behaviour of neural networks.
Figure 3. Accuracy assessment for test data set.

(sb: sugar beet, wh: wheat, pe: peas, fo: forest, wb: winter barley, po: potato, li: linseed, ov: overall).
Finally, the training processes for six network structures was visualised in 3-D form in which each pattern is represented by a marker and each land-cover class is represented by different colours. However, for the clarity of this presentation, instead of using different colours, class numbers have been used (Figure 4). In order to display the six-dimensional data onto three-dimensional space, corresponding to six image bands used, Sammons nonlinear mapping algorithm (Sammon 1969) was employed. Thus, it was possible to observe the learning process, and particularly the decision boundary determination in neural network learning. In this study the effect of network structure on learning was effectively observed by means of using six different network structures. GIF animations have been produced for each network structure using the above-mentioned techniques. Since it is not possible to put animations into this paper, they have been made available on the authors homepage (http://www.members.tripod.com/~kavzoglu/kavzoglu.htm).
Figure 4. Training process in 3-D form. After all the networks trained with the same training data, they are tested with the same test data set in a contingency matrix fashion. The results are given in Table 1. As can be seen, there is not much difference between the accuracies and the number of unrecognised pixels for the networks used. In fact, 6-10-7 network, which is the second smallest network considered, appeared to be the best choice. Since it is a small network, it has large generalisation capabilities. Table 1. Results of six network structures for the test data set.
6-8-7 Acc 78.5 Unr 202 6-10-7 Acc 80.1 Unr 175 6-11-7 Acc 79.4 Unr 208 6-13-7 Acc 80.8 Unr 191 6-15-7 Acc 78.8 Unr 218 6-5-5-7 Acc 78.8 Unr 140
Acc: overall accuracy in percentage, Unr: number of unrecognised pixels.
5. Conclusion
In this study the determination of optimum network structure for the classification of land-cover classes has been investigated. All the analyses carried out are based on the combination of visual and mathematical analyses. Five important conclusions can be drawn from the results. These are: magnitudes of the network weights increase more in the first part (between input and hidden layer section) in the training stage, accuracy does not increase gradually when the size is increased.
large networks learn tasks more quickly, but not necessarily better, large networks do not give considerably better results. As stated by Wang et al. (1994) in feed-forward neural networks trained with gradient-type algorithms, as long as the network is large enough to learn the samples, the size of the networks plays little role in the best generalisation performance of the network, scientific visualisation can provide valuable insights for understanding the behaviour of ANNs.
Perhaps, the most important conclusion derived is that large networks do not always improve the accuracy of the classification A network that is large enough to learn the characteristics of the data is usually sufficient. However, several factors, such as learning parameter, number of iterations, transfer function and the characteristics of the data, play very important role to get a network with high generalisation capabilities. Investigating the effects of these factors would be very useful to understand the behaviour of artificial neural networks.
6. Acknowledgements
The author is supported by a grant from Turkish Government. He wishes to send his deep thanks to Prof. Paul M. Mather, his supervisor, for his help and support throughout continuing PhD study. The SIR-C SAR data were kindly made available by the NASA Jet Propulsion Laboratory, Pasadena, California, USA.
7. References
ATKINSON, P. M., and TATNALL, A. R. L., 1997, Neural Networks In Remote Sensing. International Journal of Remote Sensing, 18, 699-709. BEBIS, G. and GEORGIOPOULOS, M., 1999, Feed-forward neural networks: why network size is so important. IEEE Potentials, October/November, pp. 27-31. CHESTER, D. L., 1990, Why two hidden layers are better than one. Proceedings of the International Joint Conference on Neural Networks, Washington, USA, pp. 265-268. CYBENKO, D.L., 1989, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, and Systems, 2, pp. 303-314. HIROSE, Y., YAMASHITA, K., and HIJIYA, S., 1991, Back-propagation algorithm which varies the number of hidden units. Neural Networks, 4, pp. 61-66. KAVZOGLU, T., and MATHER, P.M., 1998, Assessing artificial neural network pruning algorithm. Proceedings of the 24th Annual Conference and Exhibition of the Remote Sensing Society (RSS'98), pp. 603-609. LE CUN, Y., DENKER, J.S., and SOLLA, S.A., 1990, Optimal brain damage. In Advances in Neural Information Processing Systems 2, pp. 598-605. (San Mateo: Morgan Kaufmann). LIPPMANN, R. P., 1987, An introduction to computing with neural nets. IEEE ASSP Magazine, April, pp. 4-22. REED, R. 1993, Pruning algorithms - a survey. I.E.E.E. Transactions on Neural Networks, 4, pp. 740-747. SAMMON, J.W., 1969, A Nonlinear Mapping for Data Structure Analysis. IEEE Transactions on Computers, C-18, 401-409. WANG, C., VENKATESH, S.S., and JUDD, J.S., 1994, Optimal Stopping and Effective Machine Complexity in Learning. In Neural Information Processing 6, pp. 303-310, (San Mateo: Morgan Kaufmann).

Determining Optimum Structure For Artificial Neural Networks

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Determining Optimum Structure For Artificial Neural Networks

Hochgeladen von

Copyright:

Verfügbare Formate

In Proceedings of the 25th Annual Technical Conference and Exhibition of the Remote Sensing Society, Cardiff, UK, pp.

675-682, 8-10 September 1999.

Determining Optimum Structure for Artificial Neural Networks

2. The Importance of Artificial Neural Network Structure

Data Points Linear Poly. (2nd degree) Poly. (4th degree)

3. Test Site and Data

Figure 2. Analysis of network weights in training process.

Figure 3. Accuracy assessment for test data set.

Acc: overall accuracy in percentage, Unr: number of unrecognised pixels.

Das könnte Ihnen auch gefallen