Beruflich Dokumente
Kultur Dokumente
New Zealand
Veterinary Association / Australian Veterinary Association Second Pan Pacific Veterinary Conference,
Christchurch, 23-28 June 1996; 83-105.
displays of the data. Geographic information systems can be used to produce maps and they
allow the exploration of spatial patterns in an interactive fashion.
Exploratory data analysis
Data exploration is aimed at developing hypotheses and makes extensive use of graphical
views of the data such as maps or scatter plots. Exploratory data analysis makes few
assumptions about the data and should be robust to extreme data values. Simple analytical
models can also be used in this analysis phase.
Models of spatial data
For this type of spatial data analysis specific hypotheses are formally tested or predictions are
made using statistical models of the data. Modeling of spatial phenomena has to incorporate
the possibility of spatial dependence in order to provide a true representation of the existing
effects. Such spatial effects can be either large scale trends or local effects. The first is also
called a first order effect and it describes overall variation in the mean value of a parameter
such as rainfall. The second which is named a second order effect is produced by spatial
dependence and represents the tendency of neighboring values to follow each other in terms of
their deviation from the mean. This can for example be the case with the incidence of an
infectious animal disease affecting animals on farm properties. First order effects can be readily
modeled by standard regression models. The presence of second order effects violates the
independence assumption of standard statistical analysis techniques, and appropriate analysis
techniques will have to take account of the covariance structure in the data giving rise to these
local effects.
Often spatial data are modeled as stationary spatial processes which assumes that while
there may be dependence between neighboring observations, it is independent from absolute
location. A spatial process is isotropic, if in a stationary process covariance between
observations at different locations depends only on the distance but not on direction. Nonstationary data is almost impossible to model as most locations will require different parameter
sets. Therefore, most spatial modeling procedures begin with first identifying a trend in mean
value and then modeling the residuals from this trend as a stationary process.
With any of these models it has to be kept in mind that they are abstractions of reality, and
first or second order effects are artifacts of the modeler. Bailey and Gatrell (1995) conclude
that models can be at best 'not wrong', rather than 'right'. They add that the analyst should
always involve judgment and intuition in statistical modeling.
Problems in Spatial Data Analysis
A major factor influencing spatial data analysis is the geographical scale at which the data is
being analyzed. It may be possible to identify specific non-random patterns at a local level
which when looked at from a national level turn into random variations. Another problem can
be that many spatial data sets are based on irregularly shaped area units or there may be
directional effects. Proximity or neighborhood also may be more difficult to clearly define than
for example in time-series analysis. Any type of spatial analysis will be subject to some degree
of edge effect where area units on the map boundary do have neighbors only in one direction.
Many data analyses have to be conducted with observations based on information summarized
at a particular spatial aggregation level such as at the veterinary district. Inferences from such
analyses may only be correct if used at the same level of aggregation. This situation has also
been called the modifiable areal unit problem.
84
Case Herds
Figure 1: Dot maps of the locations of herds from a case-control study of tuberculosis breakdown in New Zealand cattle
herds
85
Lawson 1995). The latter tests relate to the clustering of events around fixed point locations
such for example a nuclear power plant. Wartenberg and Greenberg 1990 describe techniques
for detection of hot spot clusters and clinal clusters.
A tool which can be effectively used for the analysis of clustering effects is the K function
(Kingham, Gatrell, and Rowlingson 1995). In this context, two classes of point processes such
as cases of disease and random controls without the disease are compared. The principle is that
both point processes are pooled and then the point process describing the cases is compared
with the pooled process. Cuzick and Edwards (1990) developed a method which is also based
on nearest-neighbor distances. The test statistic simply compares the number of case-case pairs
for a given number of nearest neighbors. Applying this technique to the case-control data
mentioned above it appears that there is significant clustering of cases compared with the
control population (see Figure 2). The software Stat! (BioMedware, Ann Arbor, Michigan,
U.S.A.) was used to run the analysis and produce the graph.
Figure 2: Cuzick and Edwards method applied to tuberculosis breakdown case control study data (+ = cases, =
controls, arrows identify nearest neighbors)
Kingham, Gatrell, and Rowlingson (1995) describe a method combining the Diggle and
Chetwynd method based on bivariate K functions with results of a logistic regression analysis
allowing them to make use of information about additional covariates to test for clustering.
Methods aimed at the detection of hot spots or small areas which might represent clusters
of disease include the Geographical Analysis Machine developed by Openshaw (1990). This
technique is based on comparing the observed intensity of cases in circles of varying radius.
The result of the analysis is a map with circles indicating the areas where case incidence was
higher than expected under the assumption of spatial randomness. Alexander and Cuzick
(1992) reviewed methods for assessment of disease clusters. Wartenberg and Greenberg
(1993) discuss problems associated with detection of disease clusters.
87
Figure 3: Map and histogram of geographical distances between cases of tuberculosis infection in wild possums
88
Figure 4: Temporal distance map and histogram for cases of tuberculosis infection in wild possums
As a next step a formal statistical test has to be conducted to assess the statistical significance
of a potential space-time interaction process. When using Knoxs method, a critical distance in
time as well as in space defining closeness has to be set and pairs of cases are tabulated into a
2*2 contingency table with spatial and temporal closeness/farness defining the rows and
columns (Knox 1964). Knox saw the critical distance as defining latency period. In most
situations determining the critical distance requires a subjective decision. Approximate
randomization permutation techniques are used to construct a Null distribution for Knoxs test
statistic. Figure 5 shows the results of Knoxs test applied to the tuberculous possum data
using a critical distance of 100m in space and 3 months in time. The result of 30 for Knoxs
test statistic X which is significant at a p-value of 0.02 suggests that given the selected critical
distances time-space interaction is present in this data set.
89
Figure 5: Results of Knoxs method applied to cases of tuberculosis infection in wild possums
Another approach to investigate time-space interaction could be the use of the Mantel method
(Mantel 1967). Mantels approach does not require selection of critical distances. It uses both,
time and space distance matrices between all cases. But it should be kept in mind that the
Mantel test can be insensitive to non-linear associations between time and space distances.
Distance measures can be transformed in a number of ways, such as the reciprocal
transformation which reduces the effect of large time and space distances. The Null hypothesis
is that the time distances are independent of the space distances. Randomization permutation
techniques can be used to generate a test statistic for the Mantel test. Figure 6 presents the
results of this analysis when applied to the data on tuberculous possums. The scatter plot of
space distances against temporal distances seems to suggest while the points are scattered
throughout the plot that there some denser accumulations of cases present. The frequency
distribution of the test statistic under the Null hypothesis on the basis of 500 random
permutations is presented in the left window of Figure 6. It can be concluded that there is
significant space-time interaction.
90
Figure 6: Results from applying the Mantel method to test for time-space interaction between cases of tuberculosis
infection in wild possums
A third approach available is the K-nearest neighbor test of space-time interaction in point
data. The test statistic indicates the number of case pairs which are K nearest neighbors in time
and space. The statistic is based on an approximate randomization of the Mantel product
statistic. Figure 7 presents the results from applying the K-nearest neighbor method to the
possum tuberculosis data. The map shows the locations of the cases and the arrows indicate
k=2 nearest neighbors. The test statistic produced on the basis of 1000 random permutations
suggests that only the cumulative statistic Jk is statistically significant, whereas Jk is not. The
latter parameter measures the statistical significance from increasing K by 1. The test statistic
supports the presence of space-time interaction, and suggests that the first 5 nearest neighbors
are involved in space-time interaction.
91
Figure 7: Results from applying the K-nearest neighbor method to test for time-space interaction between cases of
tuberculosis infection in wild possums
92
TIN
Contour map
DTM
Figure 8: TIN and derived contour maps and digital terrain model for the possum tuberculosis longitudinal study site
As with point patterns it is also possible to use kernel estimation to convert the attribute data
from the sampling points into a surface. This time not using the number of events per unit area
but rather the value of the attribute. This technique has been used in geographical
epidemiology to model the relative risk function, measuring local risk relative to the regional
mean (Bithell 1990).
Exploratory analysis of second order effects in spatially continuous data
Spatial dependence between attribute values measured at sampled locations is described using
the covariance function or covariogram. The presence of second order effects would result in
positive covariance between observations a small distance apart and lower covariance or
correlation if they are further apart. The covariogram describes the function of the covariance
for varying distances h between sample points and the correlogram the corresponding
correlation. The semi-variogram is a graphical representation of the variation between
sampling points separated by a given distance and direction. For a stationary spatial process all
three describe similar information. Estimates of the semi-variogram are considered to be more
robust to departures from stationarity represented as a general trend in the spatial process. A
93
continuous process without spatial dependence will result in a horizontal line. A stationary
process will reach an upper bound, referred to as the sill at a distance h called the range.
Theoretically, the intercept with the y-axis should be at a value of 0 variation. In reality,
sampling error and small scale variation will result in variability at small distances and the
variogram will meet the y-axis not in the origin. This intercept with the y-axis is called the
nugget effect. Variograms which do not reach an upper bound suggest non-stationarity. Figure
9 shows an isotropic sample semi-variogram for the proportion of tuberculous possums
captured at trap sites during the longitudinal study. The shape of the variogram suggests that
the process is non-stationary, but given the relatively small nugget value there is also likely to
be spatial dependence.
Example of a variogram
Figure 9: Isotropic semi-variogram for the proportion of tuberculous possums at individual trap sites in the longitudinal
study
94
Figure 10: Omnidirectional exponential variogram model for possum tuberculosis prevalence data
The variogram model itself does not allow prediction of values. This can be achieved with
Kriging. This is a weighted moving average technique for estimating the value of a spatially
distributed variable from adjacent values while considering interdependence expressed in a
variogram. It allows the interpolation error to be mapped and from a statistical viewpoint is
considered to be the most satisfactory method for interpolation (Oliver and Webster 1990).
Pfeiffer (1994) used ordinary Kriging to produce a surface of possum population density
based on possum capture data at sample points (see Figure 11). The omnidirectional
variogram suggests that this data is more stationary than the tuberculosis prevalence data, but
it also shows strong spatial dependence. An exponential model was fitted and used as the basis
for Kriging. The distribution of Kriging errors shows that there are some reasonably high
errors and according to the map they are located in one particular area of the study.
95
A number of multivariate methods can be used for modeling of spatially continuous data.
Principal components can be used to combine the information from multiple variables into a
small number of components, each of them representing a particular combination of variables
and explaining a particular proportion of the variation in the data. Eastman and Fulk (1993)
used the technique to analyze the information contained in a time series of NDVI maps for
Africa, thereby conducting a space-time analysis. Cliff et al. (1995) discuss the application of
multidimensional scaling (MDS) to spatial epidemiological data. They use the technique to
map geographical information about measles mortality in Australia and New Zealand as
disease space where points with similar disease risks are closer to each other on the MDS map
even though they are far removed geographically. Bailey and Gatrell (1995) discuss a range of
other multivariate analysis techniques for spatially continuous data.
Area data
Attribute data which does have values within fixed polygonal zones within a study area is
referred to as area data or lattice data. The areal units can constitute a regular lattice or grid
or consist of irregular units. It is usually not required to estimate values as they should be
present for all areas. The main emphasis with area data is on detection and explanation of
spatial patterns or trends possibly extended to take account of covariates.
96
N
0
20
40
60
80
100 Kilometers
at different spatial lags. If the autocorrelation does not decline after a number of lags, it
indicates the presence of non-stationarity. The correlogram has similar applications in spatial
analysis as it has in time-series analysis for describing patterns. Hungerford (1991) analyzed the
spatial distribution of cattle anaplasmosis between counties within the state of Illinois using
second-order analysis and detected significant spatial clustering within the state.
The above mentioned methods do not provide local indicators of spatial association which
would be useful for identifying so-called hot spots. The Moran scatterplot and spatial lag pies
described in Anselin (1994) can be used to describe local patterns of variation visually.
Quantitative estimates can be obtained using the G statistic by Getis and Ord (1992) or the
local indicators of spatial association by Anselin (1995). The latter can be used as an indicator
of local pockets of non-stationarity (hot spots), similar to the G statistic, and also to assess the
influence of individual data points on the global statistic and to identify outliers. Anselin,
Dodson, and Hudak (1993) describe how these different techniques can be combined to form a
exploratory spatial analysis system. Figure 13 shows a number of examples used by these
authors to display local variation (from Anselins world wide web site
http://lambik2.rri.wvu.edu/esda.htm). The spatial lag pie map superimposes a pie on each area
with the top half of the pie representing the local value and the bottom part the neighboring
values for this particular variable. It gives the observer an appreciation of the ratio between the
local value and the surrounding spatial units. The Moran scatterplot shows the original value
of each observation on the x-axis and the value of its spatial lag on the y-axis. The plot can be
used to identify outliers or even to conduct local regression to further describe the spatial
association. These outliers can then be mapped as shown in Figure 13. The map of the areas
with significant LISA statistic indicates the area were there appears to be spatial
autocorrelation.
98
Moran scatterplot
In landscape ecology, approaches have been developed to describe the interactions among
patches within a landscape mosaic referred to as landscape pattern. Most biological processes
and that includes of course diseases are influenced by a multitude of factors which together
may form a particular pattern. Spatial patterns are particularly difficult to quantify. Ecologists
use the term landscape structure which describes the spatial relationships between habitat
patches within a landscape (Dunning, Danielson, and Leck 1992). The software FRAGSTATS
(McGarigal and Marks, Oregon State University, Corvallis, Oregon, U.S.A.) allows calculation
of a wide range of indices and parameters describing landscape structure which could be used
for further analyses.
Modeling of area data
Modeling techniques are aimed at establishing explanatory relationships between attribute
values of a dependent variables, taking account of the relative spatial arrangement of the areas
and other values associated with each area unit. Again, it is possible to focus the analysis on
first order or second order effects. Multiple ordinary least squares regression can only be
used for preliminary exploratory analyses, but suffers from the problem that in the presence of
spatial dependence the errors are not independent and that the variance is unlikely to be
constant. The presence of spatial dependence can be assessed readily using a spatial
correlogram. A range of spatial regression models have been described by Haining (1990) and
they can be implemented using the SpaceStat software mentioned above. Hungerford (1991)
analyzed the relationship between cattle density and anaplasmosis prevalence on a county basis
99
in Illinois using measures of spatial correlation. Perry et al. (1991) used a GIS to investigate
the occurrence of Rhipicephalus appendiculatus in Africa to identify the factors controlling the
distribution of the vector tick which transmits the parasite Theileria parva causing East Coast
fever, Corridor disease and January disease in cattle. A number of authors have included spatial
data into multivariate analysis as independent variables. Clifton-Hadley (1993) used spatial
descriptive measures, spatial autocorrelation and distance to particular features of interest to
analyse patterns of occurrence of badger-related tuberculosis breakdowns of cattle herds in
south-west England. Pfeiffer (1994) used a GIS to provide for point locations (cases of
disease) specific geographical variables such as height above sea level, aspect, slope and
distance to features of interest which were then used as explanatory variables in multivariate
statistical analysis.
In the field of epidemiology, parameters of interest are very often counts or proportions
which can be modeled using generalised linear modeling techniques rather than ordinary
least-squares regression. It should be noted though that spatial forms of these models are not
well developed yet. Bailey and Gatrell (1995) suggest introducing covariates into the
regression model such as the spatial coordinates or a variable representing regions categorized
broadly by location to remove the effect of spatial dependence. Glass et al. (1995) developed a
risk density map for Lyme disease based on a multiple logistic regression model, but they did
not attempt to remove spatial dependence from the data. A number of different predictive
modeling approaches for spatial data was compared by Williams et al. (1994). They used linear
and non-linear discriminant analysis, tree-based induction and neural networks to map tsetse
distributions in Zimbabwe and concluded that while the simpler methods (linear discriminant
analysis and tree-based induction) were less precise, they were easier to interpret. Figure 14
presents some preliminary results of a logistic regression analysis for prediction of Theileria
parva presence in an African country (this analysis was conducted by Perry,B.D., Kruska,R.L.,
Pfeiffer,D.U. and others at ILRI, Nairobi, Kenya). The regression model includes eight
different environmental and land use variables and is based on information collected at random
sample locations throughout the country. The model was used to generate a risk map
representing the probability of T.parva presence at a particular location given a number of risk
factors included in the model. This map is presented as a DTM and as a raster map. In
addition, two additional raster maps are shown which display the lower and upper 95%
confidence limits of T.parva presence as predicted by the regression model. The receiver
operating characteristic curve (ROC) characterizing the predictive accuracy of the model could
be used to adjust the decision making cut-off for the prediction probability balancing sensitivity
and specificity as required. In this analysis the possible presence of spatial dependence was not
taken into account.
100
Sensitivity
100
90
80
70
60
50
40
30
20
10
0
0.08
0.16
0.32
10
20
30
40
50
60
70
80
90
100
Sampling location
Figure 14: Results of a multiple logistic regression analysis for prediction of Theileria parva presence
101
102
References
Alexander,F.E. and J. Cuzick. 1992. Methods for the assessment of disease clusters.
Geographical and environmental Epidemiology: Methods for small-area Studies. Editors
P. Elliott, J. Cuzick, D. English, and R. Stern, 238-50. 382 . Oxford: Oxford University
Press.
Anselin,L. 1994. Exploratory spatial data analysis and geographic information systems. New
Tools for spatial Analysis. Editor M. Painho, 45-54. Luxembourg: Eurostat.
Anselin,L. 1995. Local indicators of spatial association - LISA. Geographical Analysis 27, no.
2: 93-115.
Anselin,L. 1992. Spatial data analysis with GIS: An introduction to application in the social
Sciences. 75 . Technical Report Series. Santa Barbara, California: National Center for
Geographic Information and Analysis.
Anselin,L., R. F. Dodson, and S. Hudak. 1993. Linking GIS and spatial data analysis in
practice. Geographical Analysis 1: 3-23.
Bailey,T.C. and A. C. Gatrell 1995. Interactive spatial data analysis. Harlow, Essex, England:
Longman Group. 413pp
Bithell, J. F. 1990. An application of density estimation to geographical epidemiology.
Statistics in Medicine 9: 691-701.
Cliff, A. D., P. Haggett, M. R. Smallman-Raynor, D. F. Stroup, and G. D. Williamson. 1995.
The application of multidimensional scaling methods to epidemiological data. Statistical
Methods in Medical Research 4: 102-23.
Clifton-Hadley, R. S. 1993. The use of a geographical information system (GIS) in the control
and epidemiology of bovine tuberculosis in south-west England. Proceedings of the Society
for Veterinary Epidemiology and Preventive Medicine, editor M. V. Thrusfield, 166-79.
Society for Veterinary Epidemiology and Preventive Medicine.
Cuzick, J., and R. Edwards. 1990. Spatial clustering for inhomogeneous populations. Journal
of the Royal Statistical Society B 52, no. 1: 73-104.
Dunning, J. B., B. J. Danielson, and C. F. Leck. 1992. Ecological processes that affect
populations in complex landscapes. Oikos 65: 169-75.
Eastman, J. R., and M. Fulk. 1993. Long sequence time series evaluation using standardized
principal components. Photogrammetric Engineering and Remote Sensing 59, no. 6: 99196.
Eastman, J. R., W. Jin, P. A. K. Kyem, and J. Toledano. 1995. Raster procedures for multicriteria/multi-objective decisions. Photogrammetric Engineering and Remote Sensing 61,
no. 5: 539-47.
Elliott, P., M. Martuzzi, and G. Shaddick. 1995. Spatial statistical methods in environmental
epidemiology: a critique. Statistical Methods in Medical Research 4: 137-59.
Getis, A., and J. K. Ord. 1992. The analysis of spatial association by use of distance statistics.
Geographical Analysis 24 (3): 189-206.
Glass, G. E., B. S. Schwartz, J. M. Morgan, D. T. Johnson, P. M. Noy, and E. Israel. 1995.
Environmental risk factors for Lyme disease identified with geographic information systems.
American Journal of Public Health 85, no. 7: 944-48.
103
Haining,R. 1990. Spatial Data Analysis in the social and environmental Sciences. Cambridge:
Cambridge University Press.
Hungerford, L. L. 1991. Use of spatial statistics to identify and test significance in geographic
disease patterns. Preventive Veterinary Medicine 11: 237-42.
Izenman, A. J. 1991. Recent developments in nonparametric density estimation. Journal of the
American Statistical Association 86 (413): 205-24.
Kingham, S. P., A. C. Gatrell, and B. Rowlingson. 1995. Testing for clustering of health events
within a geographical information system framework. Environment and Planning A 27:
809-21.
Knox, E. G. 1964. The detection of space-time interaction. Applied Statistics. 13: 25-29.
Lessard, P., R. L`Eplattenier, R. A. I. Norval, K. Kundert, T. T. Dolan, H. Croze, and others.
1990. Geographical information systems for studying the epidemiology of cattle diseases
caused by Theileria parva. Veterinary Record 126: 255-62.
Mantel, N. 1967. The detection of disease clustering and a generalized regression approach.
Cancer Research. 27 (2): 209-20.
Morris,R.S., Sanson,R.L and Stern,M.W. 1992: EPIMAN - A Decision Support System for
Managing a Foot-and-Mouth Disease Epidemic. Proceedings Fifth Annual Meeting of the
Dutch Society for Veterinary Epidemiology and Economy, Wageningen, 1-35.
Oliver, M. A., and R. Webster. 1990. Kriging: a method of interpolation for geographical
information systems. International Journal of Geographical Information Systems 4 (3):
313-32.
Openshaw, S. 1990. Automating the search for cancer clusters: a review of problems,
progress, and opportunities. Spatial epidemiology. Editor R.W. Thomas, 48-78. Pion
Publications.
Perry, B. D., R. Kruska, P. Lessard, R. A. I. Norval, and K. Kundert. 1991. Estimating the
distribution and abundance of Rhipicephalus appendiculatus in Africa. Preventive
Veterinary Medicine 11: 261-68.
Pfeiffer, D. U. 1994. The role of a wildlife reservoir in the epidemiology of bovine
tuberculosis. Unpublished PhD Thesis, Massey University, Palmerston North, New Zealand.
Rothman, K. J. 1990. A sobering start for the cluster busters' conference. American Journal of
Epidemiology 132 Sup 1: S6-S13.
Sanson, R.L. 1993. The development of a decision support system for an animal disease
emergency. Unpublished PhD Thesis. Massey University, Palmerston North, New Zealand.
Waller, L. A., and A. B. Lawson. 1995. The power of focused tests to detect disease
clustering. Statistics in Medicine 14: 2291-308.
Walter, S. D. 1993. Assessing spatial patterns in disease rates. Statistics in Medicine 12: 188594.
Wartenberg, D., and M. Greenberg. 1990. Detecting disease clusters: the importance of
statistical power. American Journal of Epidemiology 132 Sup 1: S156-S166.
Wartenberg, D., and M. Greenberg. 1993. Solving the cluster puzzle: Clues to follow and
pitfalls to avoid. Statistics in Medicine 12: 1763-70.
104
Williams, B., D. Rogers, G. Staton, B. Ripley, and T. Booth. 1994. Statistical modelling of
georeferenced data: Mapping tsetse distributions in Zimbabwe using climate and vegetation
data. Modelling vector-borne and other parasitic Diseases. Editors B. D. Perry, and J. W.
Hansen, 267-80. 369 . Nairobi, Kenya: The International Laboratory for Research on
Animal Diseases.
105